auriglyph — The compression industry is optimising the wrong number

Compression benchmarks compete on a single axis — ratio. Who squeezes a gigabyte into the fewest bytes. For blockchain archives, rollup data-availability layers, and on-chain analytics, that axis is a distraction. Those systems do not sit on the data; they query it — a sender here, a recipient there, a value, a selector — across an archive that only ever grows.

On that axis, every byte-stream compressor in production shares the same defect.

The defect: reading one field costs you the whole stream

To read transaction i out of a gzip / zstd` / `brotli blob, you must rehydrate the stream up to position i. That is O(n) in the archive size. Columnar formats (Parquet/ORC) are smarter — they seek to a row-group in O(1) — but they still decompress an entire column chunk to surface a single cell. That is O(block): a large, fixed cost paid on every point read.

Tau reads a stored field directly out of the compressed archive, touching only that transaction's packet, decompressing nothing. That is O(1) — constant in the size of the archive — in its per-transaction random-access mode.

This is not a multiplier. It is a different complexity class. And complexity classes diverge without bound.

We measured it. On real data. Losslessly.

Workload. 74,583 real Ethereum transactions (canonical mainnet stream). For each archive size N from 1,000 to 74,583, read the recipient (to) of a transaction at a random handle, and record both latency and bytes touched per read — the latter being a property of the algorithm, immune to cache, CPU, or compiler.

Result — the exponent of N, read straight off the log–log regression (O(1) ⇒ slope 0, O(n) ⇒ slope 1):

Codec	Read of one stored field	Measured slope	At N = 74,583
Tau (per-tx random access)	O(1) — zero decompression	latency +0.11, bytes −0.13	40 ns · 15 bytes touched
gzip blob	O(n) — rehydrate to position	latency +1.01 (R² 0.994)	60.7 ms · 46 MB inflated
Parquet / zstd	O(block) — decompress column chunk	flat, high floor	216 µs · 183 KB decompressed

Tau's bytes-touched slope is negative — as the archive grows, a point read touches fewer bytes, not more. That is O(1), with proof to spare.

A note on method: latency is measured warm-cache on a single core. But the load-bearing evidence is the bytes-touched column, which is cache-, CPU-, and compiler-independent by construction — it counts the bytes the algorithm must decode. So the O(1) claim does not rest on cache state; cold cache changes the constant, never the slope.

Correctness gate. Every one of the 74,583 fields read out of the compressed archive — in random order — was byte-identical to source. Zero mismatches. Random order matters: it proves the read needs no sequential replay of the stream.

Tau point-read is O(1) in archive size.
Tau is the flat line at the bottom of both latency and bytes-touched charts.

Why this is the whole game

Blockchain history is append-only and unbounded. It only grows. So the gap between O(1) and O(n) is not a number you quote once — it widens forever as the chain advances.

Versus a byte-stream blob (gzip/zstd), the gap grows without limit with history. At today's 74,583-tx slice it is already enormous; next year's archive makes it larger, by construction.
Versus columnar (Parquet/ORC), the gap is a large constant — Tau decompresses nothing per read; the columnar reader still inflates a whole row-group chunk for one cell. (We state this boundary ourselves, because it is true, and because a fixed factor is not the same as an unbounded one.)

A compressor that wins on ratio but reads in O(n) is solving a problem an archive does not have. Tau is built for the operation that actually runs at scale: query the compressed data in place.

What this is — and what it is not

Plain about the boundaries, so you do not have to find them:

O(1) applies to stored fields — to, value, nonce, the selector, the calldata. Computed fields (the transaction hash, the signature recovery sender) are an O(1) locate plus a constant recompute. We make that line explicit.
It is per-transaction random-access mode. Whole-buffer archival reaches higher density — by giving up exactly the random access that is the point here. Different regime, different tool.
Tau is not a general compressor. It is calibrated for Ethereum-class transaction shapes; adversarial random input falls through to opaque encoding with no gain.
Pilot, not "production-ready." The accumulator that backs Tau's constant-size inclusion proofs is a novel construction and has not had an external cryptographic audit yet. Treat it as a pilot capability.

The line that matters

Tau returns a query result from a compressed archive before a general-purpose compressor completes its first decompression step — and the lead grows for the entire life of the chain.

Ratio is a one-time win you bank when you write. Constant-time query is a win you collect on every read, forever. For an archive that never stops growing, those are not the same trade — and only one of them is on the right side of a complexity class.