More

terrelln · 2026-01-22T18:25:39 1769106339

Zstd should not be slower than gzip to decompress here. Given that it has inflated the files to be bigger than the uncompressed data, it has to do more work to decompress. This seems like a bug, or somehow measuring the wrong thing, and not the expected behavior.

mort96 · 2026-01-22T19:07:15 1769108835

It seems like zstd is somehow compressing really badly when many zstd processes are run in parallel, but works as expected when run sequentially: https://news.ycombinator.com/item?id=46723158

Regardless, this does not make a significant difference. I ran hyperfine again against a 37M folder of .pdf.zst files, and the results are virtually identical for zstd and gzip:

    +-------+-------+--------+-------+
    | gzip  | zstd  | brotli | xz    |
    +-------+-------+--------+-------+
    | 142ms | 165ms | 269ms  | 994ms |
    +-------+-------+--------+-------+

Raw hyperfine output:

    ~/tmp/pdfbench $ du -h zst2 gz xz br
     37M    zst2
     38M    gz
     38M    xz
     37M    br
    
    ~/tmp/pdfbench $ hyperfine ...
    Benchmark 1: for x in zst2/*; do zstd -d >/dev/null <"$x"; done
      Time (mean ± σ):     164.5 ms ±   2.3 ms    [User: 83.5 ms, System: 72.3 ms]
      Range (min … max):   162.3 ms … 172.3 ms    17 runs
    
    Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
      Time (mean ± σ):     142.2 ms ±   0.9 ms    [User: 87.4 ms, System: 43.1 ms]
      Range (min … max):   140.8 ms … 143.9 ms    20 runs
    
    Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
      Time (mean ± σ):     993.9 ms ±   9.2 ms    [User: 896.7 ms, System: 99.1 ms]
      Range (min … max):   981.4 ms … 1007.2 ms    10 runs
    
    Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
      Time (mean ± σ):     269.1 ms ±   8.8 ms    [User: 176.6 ms, System: 75.8 ms]
      Range (min … max):   261.8 ms … 287.6 ms    10 runs

terrelln · 2026-01-22T20:20:22 1769113222

Ah I understand. In this benchmark, Zstd's decompression time is 284 MB/s, and Gzip's is 330 MB/s. This benchmark is likely dominated by file IO for the faster decompressors.

On the incompressible files, I'd expect decompression of any algorithm to approach the speed of `memcpy()`. And would generally expect zstd's decompression speed to be faster. For example, on a x86 core running at 2GHz, Zstd is decompressing a file at 660 MB/s, and on my M1 at 1276 MB/s.

You could measure locally either using a specialized tool like lzbench [0], or for zstd by just running `zstd -b22 --ultra /path/to/file`, which will print the compression ratio, compression speed, and decompression speed.

[0] https://github.com/inikep/lzbench

terrelln · 2026-01-22T18:23:49 1769106229

> | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M |

Something is going terribly wrong with `zstd` here, where it is reported to compress a file of 1.1MB to 2MB. Zstd should never grow the file size by more than a very small percent, like any compressor. Am I interpreting it correctly that you're doing something like `zstd -22 --ultra $FILE && wc -c $FILE.zst`?

If you can reproduce this behavior, can you please file an issue with the zstd version you are using, the commands used, and if possible the file producing this result.

mort96 · 2026-01-22T18:35:51 1769106951

Okay now this is weird.

I can reproduce it just fine ... but only when compressing all PDFs simultaneously.

To utilize all cores, I ran:

    $ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22 & done; wait

(and similar for the other formats).

I ran this again and it produced the same 2M file from the source 1.1M file. However when I run without paralellization:

    $ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22; done

That one file becomes 1.1M, and the total size of *.zst is 37M (competitive with Brotli, which is impressive given how much faster it is to decompress).

What's going on here? Surely '-22' disables any adaptive compression stuff based on system resource availability and just uses compression level 22?

terrelln · 2026-01-22T20:01:42 1769112102

Yeah, `--adaptive` will enable adaptive compression, but it isn't enabled by default, so shouldn't apply here. But even with `--adaptive`, after compressing each block of 128KB of data, zstd checks that the output size is < 128KB. If it isn't, it emits an uncompressed block that is 128KB + 3B.

So it is very central to zstd that it will never emit a block that is larger than 128KB+3B.

I will try to reproduce, but I suspect that there is something unrelated to zstd going on.

What version of zstd are you using?

mort96 · 2026-01-22T20:26:21 1769113581

'zstd --version' reports: "** Zstandard CLI (64-bit) v1.5.7, by Yann Collet **". This is zstd installed through Homebrew on macOS 26 on an M1 Pro laptop. Also of interest, I was able to reproduce this with a random binary I had in /bin: https://floss.social/@mort/115940378643840495

I was completely unable to reproduce it on my Linux desktop though: https://floss.social/@mort/115940627269799738

terrelln · 2026-01-22T20:54:57 1769115297

I've figured out the issue. Use `wc -c` instead of `du`.

I can repro on my Mac with these steps with either `zstd` or `gzip`:

    $ rm -f ksh.zst
    $ zstd < /bin/ksh > ksh.zst
    $ du -h ksh.zst
    1.2M ksh.zst
    $ wc -c ksh.zst
     1240701 ksh.zst
    $ zstd < /bin/ksh > ksh.zst
    $ du -h ksh.zst
    2.0M ksh.zst
    $ wc -c ksh.zst
     1240701 ksh.zst
    
    $ rm -f ksh.gz
    $ gzip < /bin/ksh > ksh.gz
    $ du -h ksh.gz
    1.2M ksh.gz
    $ wc -c ksh.gz
     1246815 ksh.gz
    $ gzip < /bin/ksh > ksh.gz
    $ du -h ksh.gz
    2.1M ksh.gz
    $ wc -c ksh.gz
     1246815 ksh.gz

When a file is overwritten, the on-disk size is bigger. I don't know why. But you must have ran zstd's benchmark twice, and every other compressor's benchmark once.

I'm a zstd developer, so I have a vested interest in accurate benchmarks, and finding & fixing issues :)

mort96 · 2026-01-22T21:24:39 1769117079

Interesting!

It doesn't seem to be only about overwriting, I can be in a directory without any .zst files and run the command to compress 55 files in parallel and it's still 45M according to 'du -h'. But you're right, 'wc -c' shows 38809999 bytes regardless of whether 'du -h' shows 45M after a parallel compression or 38M after a sequential compression.

My mental model of 'du' was basically that it gives a size accurate to the nearest 4k block, which is usually accurate enough. Seems I have to reconsider. Too bad there's no standard alternative which has the interface of 'du' but with byte-accurate file sizes...

terrelln · 2026-01-22T21:54:47 1769118887

Yeah, it isn't quite that simple. E.g. `/bin/ksh` reports 1.4MB, but it is actually 2.4MB. Initially, I thought it was because the file was sparse, but there are only 493KB of zeros. So something else is going on. Perhaps some filesystem-level blocks are deduped from other files? Or APFS has transparent compression? I'm not sure.

It does still seem odd that APFS is reporting a significantly larger disk-size for these files. I'm not sure why that would ever be the case, unless there is something like deferred cleanup work.

mort96 · 2026-01-22T23:26:49 1769124409

Ross Burton on Mastodon suggests that it might be deduplication; when writing sequentially, later files can re-use blocks from earlier files, while that isn't the case as much when writing sequentially. That seems plausible enough to me.

mort96 · 2026-01-23T00:06:47 1769126807

I've concluded that this can't be the reason. It'd only result in an error where the size reported by 'du' is smaller than the apparent size (aka number of bytes reported by 'wc -c') of the file. What we see here is that the size reported by 'du' is almost twice as large as the number of bytes. That can't be the result of dedpulication.

I'll chalk it up to "some APFS weirdness".

Zekio · 2026-01-22T19:12:11 1769109131

doesn't zstd cap out at compression level 19?

mort96 · 2026-01-22T19:25:00 1769109900

From the man page:

    --ultra: unlocks high compression levels 20+ (maximum 22), using a lot more memory.

Regardless, this reproduces with random other files and with '-9' as the compression level. I made a mastodon post about it here: https://floss.social/@mort/115940378643840495

terrelln · 2025-10-30T20:11:56 1761855116

Its slang that has made its way from queer culture into mainstream. It is not meant to satisfy anyones ego, other than it just has vague positive connotations. It is used in a pretty equivalent way as "take that money, dude".

terrelln · 2025-10-07T16:17:19 1759853839

Out of curiosity, what was the input file format?

We actually worked on a demo WAV compressor a while back. We are currently missing codecs to run the types of predictors that FLAC runs. We expect to add this kind of functionality in the future, in a generic way that isn't specific to audio, and can be used across a variety of domains.

But, generally we wouldn't expect to generally beat FLAC. But, be able to offer specialized compressors for many types of data that previously weren't important enough to spawn a whole field of specialized compressors, by significantly lowering the bar for entry.

p1mrx · 2025-10-07T20:30:33 1759869033

The input was just CD audio, "One More Time" by Daft Punk.

test.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz

terrelln · 2025-10-06T22:34:30 1759790070

You could have an LLM generate the SDDL description [0] for you, or even have it write a C++ or Python tokenizer. If compression succeeds, then it is guaranteed to round trip, as the LLM-generated logic lives only on the compression side, and the decompressor is agnostic to it.

It could be a problem that is well-suited to machine learning, as there is a clear objective function: Did compression succeed, and if so what is the compressed size.

[0] https://openzl.org/api/c/graphs/sddl/

terrelln · 2025-10-06T22:31:15 1759789875

The charts in the "Results With OpenZL" section compare against all levels of zstd, xz, and zlib.

On highly structured data where OpenZL is able to understand the format, it blows Zstandard and Xz out of the water. However, not all data fits this bill.

terrelln · 2025-10-06T22:28:55 1759789735

We left it out of the paper because it is an implementation detail that is absolutely going to change as we evolve the format. This is the function that actually does it [0], but there really isn't anything special here. There are some bit-packing tricks to save some bits, but nothing crazy.

Down the line, we expect to improve this representation to shrink it further, which is important for small data. And to allow to move this representation, or parts of it, into a dictionary, for tiny data.

[0] https://github.com/facebook/openzl/blob/d1f05d0aa7b8d80627e5...

yubblegum · 2025-10-06T22:33:46 1759790026

Thanks! (Super cool idea btw.)

terrelln · 2025-10-06T22:19:22 1759789162

Do you happen to have a pointer to a good open source dataset to look at?

Naively and knowing little about CRAM, I would expect that OpenZL would beat Zstd handily out of the box, but need additional capabilities to match the performance of CRAM, since genomics hasn't been a focus as of yet. But it would be interesting to see how much we need to add is generic to all compression (but useful for genomics), vs. techniques that are specific only to genomics.

We're planning on setting up a blog on our website to highlight use cases of OpenZL. I'd love to make a post about this.

bede · 2025-10-06T22:55:36 1759791336

For BAM this could be a good place to start: https://www.htslib.org/benchmarks/CRAM.html

Happy to discuss further

terrelln · 2025-10-06T23:04:08 1759791848

Amazing, thank you!

I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.

fwip · 2025-10-07T03:45:55 1759808755

Another format that might be worth looking at in the bioinformatics world is hdf5. It's sort of a generic file format, often used for storing multiple related large tables. It has some built-in compression (gzip IIRC) but supports plugins. There may be an opportunity to integrate the self-describing nature of the hdf5 format with the self-describing decompression routines of openZL.

felixhandte · 2025-10-07T18:06:57 1759860417

Wanna hop over to https://github.com/facebook/openzl/issues/76?

terrelln · 2025-10-06T21:52:23 1759787543

You'd have to tell OpenZL what your format looks like by writing a tokenizer for it, and annotating which parts are which. We aim to make this easier with SDDL [0], but today is not powerful enough to parse JSON. However, you can do that in C++ or Python.

Additionally, it works well on numeric data in native format. But JSON stores it in ASCII. We can transform ASCII integers into int64 data losslessly, but it is very hard to transform ASCII floats into doubles losslessly and reliably.

However, given the work to parse the data (and/or massage it to a more friendly format), I would expect that OpenZL would work very well. Highly repetitive, numeric data with a lot of structure is where OpenZL excels.

[0] https://openzl.org/api/c/graphs/sddl/

kstenerud · 2025-10-07T01:08:47 1759799327

I've done a binary representation of JSON-structured data that uses unary coding for variable length length fields: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...

This tends to confuse generic compressors, even though the sub-byte data itself usually clusters around the smaller lengths for most data and thus can be quite repetitive (plus it's super efficient to encode/decode). Could this be described such that OpenZL can capitalize on it?

terrelln · 2025-10-06T21:40:30 1759786830

> Is openzl indexable

Not today. However, we are considering this as we are continuing to evolve the frame format, and it is likely we will add this feature in the future.