once you have a general file-system representation format (with all its complexities) adding compression of the blobs seems a minor addition.
On a related note I was surprised to discover that the Windows 10 backup tool was able to store about 240GB of various data in barely 80GB of backup; I believe it must have spliced most files to look for common fragment to deduplicate (with some NTFS magic behind maybe). .tar.xz will never be able to do that, if I try to compress 10 copies of the entire firefox codebase, it will never be able to recognize the duplicated files; only something like a file-system+compression could do that.
> only something like a file-system+compression could do that
borg handles this just fine. I put all kinds of stuff into borg repositories: raw MySQL/PostgreSQL data directories, tar archives (both compressed and uncompressed), or just / recursively. You can do stuff like:
$ tar -caf - / | borg create …
or even
$ borg create … </dev/sda
and your repository grows by the amount of data changed since last backup (or by a couple of kilobytes if nothing has changed).
> once you have a general file-system representation format (with all its complexities) adding compression of the blobs seems a minor addition
That's a vacuous statement, since there will never be a "general file-system representation format"; that's my point. Even if someone collected together all the features of every filesystem ever developed, that would still ignore those which haven't been invented yet.
Further, it requires a choice of which compression algorithm? What about those that haven't been invented yet?
These problems only arise if we want to define "one true archiver+compressor". If we keep these concerns separate, there's no problem: we choose/create a format for our data, and choose/create a compressor appropriate to our requirements (speed, size, ratio, availability, etc.)
> .tar.xz will never be able to do that, if I try to compress 10 copies of the entire firefox codebase, it will never be able to recognize the duplicated files; only something like a file-system+compression could do that
This seems to miss my point, in several ways:
Firstly, xz has a relatively small dictionary, so your use-case would benefit from an algorithm which detects long-range patterns. Choosing a different compression algorithm for a .tar file is trivial, since it's a separate step; whereas formats like 7zip, zip, etc. lock us in to a meagre handful of hard-coded algorithms. That's the point I'm trying to make.
Secondly, .tar is designed for storing what it's given "as is": giving it hardlinked copies of the Firefox source will produce an archive with one copy and some links, as expected; giving it multiple separate copies will produce an archive with multiple copies, as expected. That's not appropriate for your use-case, so you would benefit from a different archive format that performs deduplication. Again, you're only free to do this if you don't conflate archiving with compression!
In your case, it looks like a .wim.lrzip file would be the best combination: deduplicating files where possible, and compressing any long-range redundancies that remain. This should give better compression, and scale to larger sizes, than either .tar.xz or .7z
(Note that WIM seems to also make the mistake of hard-coding a handful of compression algorithms, so you'd want to ignore that option and use its raw, uncompressed mode. My brief Googling didn't find any alternative formats which avoid such hard-coding :( )
I might have made stronger statements than warranted...
What I was trying to say is that .tar IS a filesystem description format; used to convert the filesystem into a stream that is then compressed separately.
On a related note I was surprised to discover that the Windows 10 backup tool was able to store about 240GB of various data in barely 80GB of backup; I believe it must have spliced most files to look for common fragment to deduplicate (with some NTFS magic behind maybe). .tar.xz will never be able to do that, if I try to compress 10 copies of the entire firefox codebase, it will never be able to recognize the duplicated files; only something like a file-system+compression could do that.