Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

once you have a general file-system representation format (with all its complexities) adding compression of the blobs seems a minor addition.

On a related note I was surprised to discover that the Windows 10 backup tool was able to store about 240GB of various data in barely 80GB of backup; I believe it must have spliced most files to look for common fragment to deduplicate (with some NTFS magic behind maybe). .tar.xz will never be able to do that, if I try to compress 10 copies of the entire firefox codebase, it will never be able to recognize the duplicated files; only something like a file-system+compression could do that.



> only something like a file-system+compression could do that

borg handles this just fine. I put all kinds of stuff into borg repositories: raw MySQL/PostgreSQL data directories, tar archives (both compressed and uncompressed), or just / recursively. You can do stuff like:

  $ tar -caf - / | borg create …
or even

  $ borg create … </dev/sda
and your repository grows by the amount of data changed since last backup (or by a couple of kilobytes if nothing has changed).

https://github.com/borgbackup/borg


> once you have a general file-system representation format (with all its complexities) adding compression of the blobs seems a minor addition

That's a vacuous statement, since there will never be a "general file-system representation format"; that's my point. Even if someone collected together all the features of every filesystem ever developed, that would still ignore those which haven't been invented yet.

Further, it requires a choice of which compression algorithm? What about those that haven't been invented yet?

These problems only arise if we want to define "one true archiver+compressor". If we keep these concerns separate, there's no problem: we choose/create a format for our data, and choose/create a compressor appropriate to our requirements (speed, size, ratio, availability, etc.)

> .tar.xz will never be able to do that, if I try to compress 10 copies of the entire firefox codebase, it will never be able to recognize the duplicated files; only something like a file-system+compression could do that

This seems to miss my point, in several ways:

Firstly, xz has a relatively small dictionary, so your use-case would benefit from an algorithm which detects long-range patterns. Choosing a different compression algorithm for a .tar file is trivial, since it's a separate step; whereas formats like 7zip, zip, etc. lock us in to a meagre handful of hard-coded algorithms. That's the point I'm trying to make.

Secondly, .tar is designed for storing what it's given "as is": giving it hardlinked copies of the Firefox source will produce an archive with one copy and some links, as expected; giving it multiple separate copies will produce an archive with multiple copies, as expected. That's not appropriate for your use-case, so you would benefit from a different archive format that performs deduplication. Again, you're only free to do this if you don't conflate archiving with compression!

In your case, it looks like a .wim.lrzip file would be the best combination: deduplicating files where possible, and compressing any long-range redundancies that remain. This should give better compression, and scale to larger sizes, than either .tar.xz or .7z

(Note that WIM seems to also make the mistake of hard-coding a handful of compression algorithms, so you'd want to ignore that option and use its raw, uncompressed mode. My brief Googling didn't find any alternative formats which avoid such hard-coding :( )


I might have made stronger statements than warranted...

What I was trying to say is that .tar IS a filesystem description format; used to convert the filesystem into a stream that is then compressed separately.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: