As far as I know, the standard tool for this is rdfind. This new tool claims to be "blazingly fast", so it should provide something to show it. Ideally a comparison with rdfind, but even a basic benchmark would make it less dubious. https://github.com/pauldreik/rdfind
But the main problem is not the suspicious performance, it's the lack of explanation. The tool is supposed to "find duplicate files (photos, videos, music, documents)". Does it mean it is restricted to some file types? Does it find identical photos with different metadata to be duplicates? Compare this with rdfind which clearly describes what it does, provides a summary of its algorithm, and even mentions alternatives.
Overall, it may be a fine toy/hobby project (3 commits only, 3 months ago), I didn't read the code (except for finding the command-line options). I don't get why it got so much attention.
I use and test assorted duplicate finders regularly.
fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.
The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are:
afaik fdupes is super slow because it checksums entire files in order to find duplicates. This causes a ton of unnecessary IO if you have a lot of size collisions.
The efficient way to do things is to just read files in parallel and break once they diverge. Basically how `cmp` works.
Rdfind is the logical evolution of fdupes. Not only faster but also more clever. This does not mean that fdupes is a bad tool at all, but rdfind can do things that fdupes can't.
Example. For fileX.txt being X=1 to 10000 you have three copies of each archive in:
mydir/file-X.txt
mydir/subdir/file-X.txt and
mydir/subdir/copy/file-X.txt.
Fdupes would delete random files in mydir/, mydir/subdir/ and mydir/subdir/copy/. You would end with the remaining files scattered by all the directory tree. A mess with three incomplete copies.
Rdfind correctly guess that what most people would want is to remove entirely all files in two of the directories and keep one copy (files and tree-dir) intact. So it wipes the inner subdirs in a predictable way and keeps the outer dir intact. This is a terrific feature able to disentangle one tree directory cloned and nested into the original copy without distroying it, like in this case:
Instead of trying to be clever, fclones gives the user a choice. It can select files to remove by their nesting level, creation/update/access time or glob expression (include/exclude). It also allows the user to modify the list of found files before deleting.
jdupes, rdfind, fclones and possibly many others compute checksums of prefixes (and sometimes suffixes) of the files. That reduces the number of files to be fully hashed.
Computing incremental checksums in parallel and breaking up once checksums diverge does not work very well on HDDs because the files can be in distant physical locations, and that would cause lot of seeks (seeks are terribly slow on rotational drives).
That depends on the distribution of offsets at which a difference is found (if any)
If all equally-sized files on your disk are duplicates, it makes you read both files in their entirety in a scatter version. That’s (expected to be) slow, indeed.
However, if most of them aren’t, chances are you find a difference in the first block read, which means you have to read only two blocks to decide they’re different.
So, what’s that distribution? Also, where is one most likely to find differences in equally-sized files? First block? Last?
> However, if most of them aren’t, chances are you find a difference in the first block read,
But just to read the first block, you need to open both files and move the heads from one to another at least once. If the files are placed in distant cylinders, this will be a significant amount of time (a few ms).
At the beginning, fclones also processed files in each group matching by size together. But files matching by size do not necessarily live physically close to each other. Later fclones got a great speedup on HDD when I switched to processing the files in their physical order, regardless of their logical grouping. It is a bit more complex to do in terms of programming because in each grouping pass you have to "ungroup" the files to a flat list, sort the list by locations and group again, but it is really worth it (on rotational drives). You can read about this here: https://pkolaczk.github.io/disk-access-ordering/
Well yes, that's all true. The approach of alternating reads of blocks is slow only if you need to do it. If there's a diff in the first block (or so) then there's not much to read.
But it's a good example why there can't be one optimal strategy for finding duplicates. Depending on the data set composition, various implementations turn out the fastest. It depends heavily on where most the files differ (or not).
I think we need a lookup table of marketing speech to real-world performance metrics. Blazingly fast has been showing up a lot lately.
The cynical side of me wants to know what features and safety checks a "blazingly fast" tool has not implemented that the older "glacially slow" tool it is replacing ended up implementing after all the edge conditions were uncovered.
Do you know of any tool that does a good job of finding files that differ only in their metadata or even better can use a perceptual hash to find possible matches? Geeqie's find duplicates seems to do the latter, but afaict you can't run that function from the command line.
Over the years I've used many, many tools intended to solve this problem. In the end, after much frustration, I just use existing tools, glued together in a un*x manner.
find * -type f -exec md5sum '{}' ';' \
| tee /tmp/index_file.txt \
| gawk '{print $1}' \
| sort | uniq -c \
| gawk '/^ *1 /{ print $2 } \
> /tmp/duplicates.txt
for m in $( cat /tmp/duplicates.txt )
do
grep $m /tmp/index_file.txt
echo ========
done \
| less
Tweak as necessary. I do have a comparison executable that only compares sizes and sub-portions to save time, but I generally find it's not worth it.
It takes less time to type this that than it does to remember what some random other tool is called, or how to use it. I also have saved a variant that identifies similar files, and another that identifies directory structures with lots of shared files, but those are (understandably) more complex (and fragile).
Most programs I tested have very simple basic usage - just the program name and a list of directories. I doubt typing the above would be faster, and figuring out for the first time - definitely not. Also executing that on a million of files would take ages, even compared to slowest proper duplicate finders.
Anyway, thanks for sharing - it is always very exciting to see how far you can go with a few unix utilities and a bit of scripting :)
I haven't used them for quite some time, so I'll have to dig them out. I'll also have to remember their limitations so you can better assess whether they're useful as stands, or need to be used purely as inspiration.
A shameless plug but it is a simple —and probably bad written— tool I made many years ago to scratch an itch, and I still use it.
It finds duplicate files and replaces them with hard links, saving you space. Just make sure you provide it with paths in the same filesystem.
I originally wrote it to save some space from personal files (videos, photos, etc), but it turned out very useful for tar files, docker images, websites, and more.
For example I maintain a tar file and a docker image with Kafka connectors which share many jar files. Using duphard I can save hundreds of megabytes, or even more than a gigabyte! For a documentation website with many copies of the same image (let's just say some static generators favor this practice for maintaining multiple versions), I can reduce the website size by 60%+, which then makes ssh copies, docker pulls, etc way faster speeding up deployment times.
Unfortunately fdupes does not do this or at least nothing in their docs points to this functionality.
I say unfortunately because before writing duphard, I tested fdupes and a couple other utilities (duff and duperemove) but none offered the functionality I needed.
This program uses CRC32 to compute hashes. This is a terrible idea - a 32bit hash is just too short and the probability of collisions is way too high. Only a few thousand files are enough to get 50% probability of a collision. Even though this is decreased by additional matching by extensions and sizes, I wouldn't trust it to delete any files.
Use fclones, fslint, jdupes, rdfind instead, which either use much stronger hashes (128-bit) or even verify files by direct byte-to-byte comparison.
AFAIK CRCs are not the fastest "hashes" you can get. Some non-cryptographic hashes outperform crcs by a large factor and provide longer checksums and much better statistical properties.
I've been using 'czkawka' since it's earliest inception. Seems to do a similar tast using file hashes but can also search through to match Pictures and the like.
https://github.com/qarmin/czkawka
As the author of the tool, thanks a lot for wonderful inputs! Many comments are actionable. I'll incorporate them in code soon.
Now to address a few concerns:
# The tool doesn't delete anything -- As the name suggests, it just finds duplicates. Check it out.
# File uniqueness is determined by file extension + file size + CRC32 of first 4KiB, middle 2KiB and last 2KiB
# Above seems not much. But, on my portable hard drive with >172K files (mix of video, audio, pics and source code), I got the same number of collisions as that of "SHA-256 of entire file" (By the way, I'm planning to add an option in the tool to do this)
This sounds interesting and should probably be able to run on whole system. What if you run in the files of the OS itself? e.g. Whole C drive or where Linux system files are. Will there be any collisions?
FWIW if people are interested, I wrote https://github.com/karteum/kindfs for the purpose of indexing the hard drive, with the following goals
* being able to detect not only duplicate files but also duplicate dirs (without returning all their sub-contents as duplicates)
* being able to query multiple times without having to re-scan, and to do other types of queries (i.e. I am computing a hash on all files, not only of those with duplicate sizes. This makes scanning slower but enables other use-cases. N.b. beware that I only hash fixed portions of files for files>3MB, which is enough for my use-case considering that I always triple-check the results and is a reasonable tradeoff for performance, but it might not be OK for everyone !)
* being able to tell whether all files in dir1/ are included in dir2/ (regardless of file/dir structure)
* being able to mount the sqlite index as a FUSE FS (which is convenient for e.g. diff -r or qdirstat...)
Still work-in-progress, yet it works for several of my use-cases
The first one i found and still use when it got obvious that fslint is EOL is czkawka [0] (meaning hiccup in polish). Its' speed is an order of magnitude higher than fslint, memory use is 20%-75%.
<;)> Satisfied customer, would buy it again. </;)>
On Windows if you download the same file more than once you will have foo.doc, "foo (1).doc", "foo (2).doc" etc. A script that just looked for files with such names, compared them to foo.doc, and deleted them if they are the same would be useful.
Does it only find duplicate files or will it also find duplicate directory hierarchies?
Example:
/some/location/one/January/Photos
/some/location/two/January/Photos
I need a tool that would return a match on January directory.
It would be great to be able to filter things. So for example, if I have backups of my dev folder, I want to filter out all the virtual envs (venv below):
/home/HumblyTossed/dev/venv/bin
/home/HumblyTossed/backups/dev/venv/bin
If you want something that scales horizontally (mostly), dcmp from https://github.com/hpc/mpifileutils is an option. It can chunk up files and do the comparison in parallel on multiple servers.
A better implementation could be to perform a full hash (not CRC32 though; maybe even a byte-by-byte comparison) when the fuzzy hashes match, which is a small probability anyway.
No. CRC32 doesn't care what you throw at it. It is related to speed of building a "database", for lack of a better word, of the file. Instead of CRC32 entire file, you just get chunks of it, increasing the speed. However this approach it's definitely flawed as there are plenty of file types that have the beginning and the end identical, so only the readings/CRC32 of the middle section might be actually useful. But CRC32 has a lower space hence collisions have higher chance to happen.
The better approach might be, for same size files, to just Seek(FileSize div 2) and read 32 bytes from there. If those are identical with another file then start a full file comparison until one character diverges then stop. If multiple files are having these same middle bytes then maybe do, for each file, a full SHA256 and compare those.
Also, as other commenters pointed, you might have same info but meta is different (videos, pictures, etc) so that needs to be implemented as well.
I wrote my own duplicate file finder way back in the days.
I did the obvious trick of binning by size before trying to compute any hashes, and was mildly surprised to find how few out of my ~million files had exactly the same size.
For multiple files with identical size I just did the full file MD5, we only had HDD's back then and we all know how much they like random access.
I wrote one too, over 20 years ago. Still works, that .exe, even today. Unsurprisingly I was using CRC32 too. When I look at the code that is there I cringe, such is the mess there. Oh well, everyone has to start somewhere.
But the main problem is not the suspicious performance, it's the lack of explanation. The tool is supposed to "find duplicate files (photos, videos, music, documents)". Does it mean it is restricted to some file types? Does it find identical photos with different metadata to be duplicates? Compare this with rdfind which clearly describes what it does, provides a summary of its algorithm, and even mentions alternatives.
Overall, it may be a fine toy/hobby project (3 commits only, 3 months ago), I didn't read the code (except for finding the command-line options). I don't get why it got so much attention.