Please add (I believe this should be trivial) all the tar's metadata-related fea...

c0l0 · on June 23, 2022

This is kinda naive to wish for - tar has applications that reach beyond stuffing a BLOB containing other BLOBs and some metadata into a seekable file somewhere, such as archiving data to magnetic tape/LTO.

And even if you develop the perfect archive format that happens to nail every possible use-case 100% (which you won't, because there just are too many), you will STILL have to deal with compressed and archived artifacts which accumulated over the last six or so decades in various places.

qwerty456127 · on June 23, 2022

> tar has applications... magnetic tape/LTO.

That's a parallel universe far far away, ~99% of people having to use tar (just because it's the standard and the only format supporting links/access metadata) every day will never have to use a tape drive. I don't see a reason why we can't use 2 separate formats - one for everyday packaging, a different one for tape drives.

> And even if you develop the perfect archive format

There is a perfect format already - it's 7z. Just add file access rights and links support to it. No need to invent anything really new.

kelnos · on June 23, 2022

> There is a perfect format already - it's 7z. Just add file access rights and links support to it.

These two sentences seem to contradict each other.

qwerty456127 · on June 23, 2022

Not really. Adding support for a particular kind of metadata is a change so minuscule it doesn't qualify as a change in the format. Apple just stores their filesystem metadata as in a special sub-directory in ZIP files. And the only problem with the Apple's solution is nobody else respects it. 7zip is a format developed and maintained by a specific author who is alive and active so he can just build the same in the standard 7zip implementations and chances are everybody will accept.

By the way I have just found an actual imperfection in 7zip: it can't let you choose the order in which archived files are stored in it nor chose different compression parameters for specific files. This limits its applicability. E.g. the EPUB standard says the first file in an archive must be "mimetype" and it must be not compressed. But I believe this can be fixed with reasonable ease (and probably without breaking changes) as well.

chriswarbo · on June 23, 2022

7z makes the same mistake as zip: implementing both a filesystem and a compression algorithm. There's no way a single tool can implement all the bells and whistles of the various filesystems in use today. For example, skimming through the 7zip format ( https://py7zr.readthedocs.io/en/latest/archive_format.html ) I can see specific support for file attributes from (current versions of) Windows and Unix, but no AmigaOS protect bits ( http://www.jaruzel.com/amiga/amiga-os-command-reference-help... )

Keeping these two tasks separate allows swapping-out the implementation of each (e.g. I tend to use .tar.lz these days, since I'm mostly on Unix)

afiori · on June 23, 2022

once you have a general file-system representation format (with all its complexities) adding compression of the blobs seems a minor addition.

On a related note I was surprised to discover that the Windows 10 backup tool was able to store about 240GB of various data in barely 80GB of backup; I believe it must have spliced most files to look for common fragment to deduplicate (with some NTFS magic behind maybe). .tar.xz will never be able to do that, if I try to compress 10 copies of the entire firefox codebase, it will never be able to recognize the duplicated files; only something like a file-system+compression could do that.

5e92cb50239222b · on June 23, 2022

> only something like a file-system+compression could do that

borg handles this just fine. I put all kinds of stuff into borg repositories: raw MySQL/PostgreSQL data directories, tar archives (both compressed and uncompressed), or just / recursively. You can do stuff like:

  $ tar -caf - / | borg create …

or even

  $ borg create … </dev/sda

and your repository grows by the amount of data changed since last backup (or by a couple of kilobytes if nothing has changed).

https://github.com/borgbackup/borg

chriswarbo · on June 23, 2022

> once you have a general file-system representation format (with all its complexities) adding compression of the blobs seems a minor addition

That's a vacuous statement, since there will never be a "general file-system representation format"; that's my point. Even if someone collected together all the features of every filesystem ever developed, that would still ignore those which haven't been invented yet.

Further, it requires a choice of which compression algorithm? What about those that haven't been invented yet?

These problems only arise if we want to define "one true archiver+compressor". If we keep these concerns separate, there's no problem: we choose/create a format for our data, and choose/create a compressor appropriate to our requirements (speed, size, ratio, availability, etc.)

> .tar.xz will never be able to do that, if I try to compress 10 copies of the entire firefox codebase, it will never be able to recognize the duplicated files; only something like a file-system+compression could do that

This seems to miss my point, in several ways:

Firstly, xz has a relatively small dictionary, so your use-case would benefit from an algorithm which detects long-range patterns. Choosing a different compression algorithm for a .tar file is trivial, since it's a separate step; whereas formats like 7zip, zip, etc. lock us in to a meagre handful of hard-coded algorithms. That's the point I'm trying to make.

Secondly, .tar is designed for storing what it's given "as is": giving it hardlinked copies of the Firefox source will produce an archive with one copy and some links, as expected; giving it multiple separate copies will produce an archive with multiple copies, as expected. That's not appropriate for your use-case, so you would benefit from a different archive format that performs deduplication. Again, you're only free to do this if you don't conflate archiving with compression!

In your case, it looks like a .wim.lrzip file would be the best combination: deduplicating files where possible, and compressing any long-range redundancies that remain. This should give better compression, and scale to larger sizes, than either .tar.xz or .7z

(Note that WIM seems to also make the mistake of hard-coding a handful of compression algorithms, so you'd want to ignore that option and use its raw, uncompressed mode. My brief Googling didn't find any alternative formats which avoid such hard-coding :( )

afiori · on June 23, 2022

I might have made stronger statements than warranted...

What I was trying to say is that .tar IS a filesystem description format; used to convert the filesystem into a stream that is then compressed separately.

qwerty456127 · on June 23, 2022

I understand your point yet the whole point of mine is straight opposite.