Go Find Duplicates: A fast and simple tool to find duplicate files

idoubtit · on Aug 30, 2021

As far as I know, the standard tool for this is rdfind. This new tool claims to be "blazingly fast", so it should provide something to show it. Ideally a comparison with rdfind, but even a basic benchmark would make it less dubious. https://github.com/pauldreik/rdfind

But the main problem is not the suspicious performance, it's the lack of explanation. The tool is supposed to "find duplicate files (photos, videos, music, documents)". Does it mean it is restricted to some file types? Does it find identical photos with different metadata to be duplicates? Compare this with rdfind which clearly describes what it does, provides a summary of its algorithm, and even mentions alternatives.

Overall, it may be a fine toy/hobby project (3 commits only, 3 months ago), I didn't read the code (except for finding the command-line options). I don't get why it got so much attention.

artemisart · on Aug 30, 2021

See also fclones (focuses on performance, has benchmarks https://github.com/pkolaczk/fclones). I didn't know about rdfind but thought the standard was fdupes https://github.com/adrianlopezroche/fdupes, which is as fast (or slow) as rdfind according to fclones (and fclones is much faster).

jjav · on Aug 31, 2021

I use and test assorted duplicate finders regularly.

fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.

The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are:

https://github.com/jbruchon/jdupes

https://github.com/pauldreik/rdfind

https://github.com/jvirkki/dupd

https://github.com/sahib/rmlint

Had not encountered fclones before, will give it a try.

justinsaccount · on Aug 30, 2021

afaik fdupes is super slow because it checksums entire files in order to find duplicates. This causes a ton of unnecessary IO if you have a lot of size collisions.

The efficient way to do things is to just read files in parallel and break once they diverge. Basically how `cmp` works.

pvaldes · on Aug 30, 2021

Rdfind is the logical evolution of fdupes. Not only faster but also more clever. This does not mean that fdupes is a bad tool at all, but rdfind can do things that fdupes can't.

Example. For fileX.txt being X=1 to 10000 you have three copies of each archive in:

mydir/file-X.txt

mydir/subdir/file-X.txt and

mydir/subdir/copy/file-X.txt.

Fdupes would delete random files in mydir/, mydir/subdir/ and mydir/subdir/copy/. You would end with the remaining files scattered by all the directory tree. A mess with three incomplete copies.

Rdfind correctly guess that what most people would want is to remove entirely all files in two of the directories and keep one copy (files and tree-dir) intact. So it wipes the inner subdirs in a predictable way and keeps the outer dir intact. This is a terrific feature able to disentangle one tree directory cloned and nested into the original copy without distroying it, like in this case:

a/b/c/d/files00X.txt

a/b/a/b/c/d/files00X.txt

pkolaczk · on Aug 31, 2021

Instead of trying to be clever, fclones gives the user a choice. It can select files to remove by their nesting level, creation/update/access time or glob expression (include/exclude). It also allows the user to modify the list of found files before deleting.

pkolaczk · on Aug 31, 2021

jdupes, rdfind, fclones and possibly many others compute checksums of prefixes (and sometimes suffixes) of the files. That reduces the number of files to be fully hashed.

Computing incremental checksums in parallel and breaking up once checksums diverge does not work very well on HDDs because the files can be in distant physical locations, and that would cause lot of seeks (seeks are terribly slow on rotational drives).

jjav · on Aug 31, 2021

> The efficient way to do things is to just read files in parallel and break once they diverge.

That's going to be pretty slow on a hard disk because it scatters the reads causing a lot of head movement.

Someone · on Aug 31, 2021

That depends on the distribution of offsets at which a difference is found (if any)

If all equally-sized files on your disk are duplicates, it makes you read both files in their entirety in a scatter version. That’s (expected to be) slow, indeed.

However, if most of them aren’t, chances are you find a difference in the first block read, which means you have to read only two blocks to decide they’re different.

So, what’s that distribution? Also, where is one most likely to find differences in equally-sized files? First block? Last?

pkolaczk · on Aug 31, 2021

> However, if most of them aren’t, chances are you find a difference in the first block read,

But just to read the first block, you need to open both files and move the heads from one to another at least once. If the files are placed in distant cylinders, this will be a significant amount of time (a few ms).

At the beginning, fclones also processed files in each group matching by size together. But files matching by size do not necessarily live physically close to each other. Later fclones got a great speedup on HDD when I switched to processing the files in their physical order, regardless of their logical grouping. It is a bit more complex to do in terms of programming because in each grouping pass you have to "ungroup" the files to a flat list, sort the list by locations and group again, but it is really worth it (on rotational drives). You can read about this here: https://pkolaczk.github.io/disk-access-ordering/

jjav · on Aug 31, 2021

Well yes, that's all true. The approach of alternating reads of blocks is slow only if you need to do it. If there's a diff in the first block (or so) then there's not much to read.

But it's a good example why there can't be one optimal strategy for finding duplicates. Depending on the data set composition, various implementations turn out the fastest. It depends heavily on where most the files differ (or not).

justinsaccount · on Aug 30, 2021

Yeah, this tool does not appear to be very good, especially compared to established alternatives.

It initially groups files that have the "same extension and same size", so you're out of luck if you have two copies named foo.jpg and foo.jpeg.

Then, it cheats by computing a crc32 (!) of the beginning, middle, and end bytes of the file and groups together files that have the same crc32.

So, it'll mostly work, but miss a lot of duplicates, and potentially flag different files as duplicates.

pdimitar · on Aug 31, 2021

What other alternative tools do you know about? I'm curious and want to create a collection of such tools.

vlovich123 · on Aug 31, 2021

See fclones mentioned elsewhere in this thread. Seems like a better & faster tool from a quick glance.

brundolf · on Aug 30, 2021

Sounds like Apple's new photo scanning system

diskzero · on Aug 30, 2021

I think we need a lookup table of marketing speech to real-world performance metrics. Blazingly fast has been showing up a lot lately.

The cynical side of me wants to know what features and safety checks a "blazingly fast" tool has not implemented that the older "glacially slow" tool it is replacing ended up implementing after all the edge conditions were uncovered.

code_biologist · on Aug 30, 2021

I've found rmlint to be another very good tool in this space: https://github.com/sahib/rmlint

nieve · on Aug 31, 2021

Do you know of any tool that does a good job of finding files that differ only in their metadata or even better can use a perceptual hash to find possible matches? Geeqie's find duplicates seems to do the latter, but afaict you can't run that function from the command line.

ColinWright · on Aug 30, 2021

Over the years I've used many, many tools intended to solve this problem. In the end, after much frustration, I just use existing tools, glued together in a un*x manner.

  find * -type f -exec md5sum '{}' ';' \
  | tee /tmp/index_file.txt            \
  | gawk '{print $1}'                  \
  | sort | uniq -c                     \
  | gawk '/^ *1 /{ print $2 }          \
  > /tmp/duplicates.txt

  for m in $( cat /tmp/duplicates.txt )
  do
    grep $m /tmp/index_file.txt
    echo ========
    done \
  | less

Tweak as necessary. I do have a comparison executable that only compares sizes and sub-portions to save time, but I generally find it's not worth it.

It takes less time to type this that than it does to remember what some random other tool is called, or how to use it. I also have saved a variant that identifies similar files, and another that identifies directory structures with lots of shared files, but those are (understandably) more complex (and fragile).

pkolaczk · on Aug 31, 2021

Most programs I tested have very simple basic usage - just the program name and a list of directories. I doubt typing the above would be faster, and figuring out for the first time - definitely not. Also executing that on a million of files would take ages, even compared to slowest proper duplicate finders.

Anyway, thanks for sharing - it is always very exciting to see how far you can go with a few unix utilities and a bit of scripting :)

mikst · on Aug 31, 2021

You need to check for empty files

find ... \! -empty ...

they have the same hash, but they do not need to be treated as duplicate

pdimitar · on Aug 31, 2021

Can you share your other scripts? They sound exactly like what I need lately!

ColinWright · on Aug 31, 2021

I haven't used them for quite some time, so I'll have to dig them out. I'll also have to remember their limitations so you can better assess whether they're useful as stands, or need to be used purely as inspiration.

I'll check on it/them.

andmarios · on Aug 30, 2021

A shameless plug but it is a simple —and probably bad written— tool I made many years ago to scratch an itch, and I still use it.

It finds duplicate files and replaces them with hard links, saving you space. Just make sure you provide it with paths in the same filesystem.

I originally wrote it to save some space from personal files (videos, photos, etc), but it turned out very useful for tar files, docker images, websites, and more. For example I maintain a tar file and a docker image with Kafka connectors which share many jar files. Using duphard I can save hundreds of megabytes, or even more than a gigabyte! For a documentation website with many copies of the same image (let's just say some static generators favor this practice for maintaining multiple versions), I can reduce the website size by 60%+, which then makes ssh copies, docker pulls, etc way faster speeding up deployment times.

https://github.com/andmarios/duphard

pkolaczk · on Aug 31, 2021

A shameless plug: fclones does that as well, including support for symlinks.

andmarios · on Aug 31, 2021

It looks good! When I wrote duphard, fclones didn't exist. :)

coryrc · on Aug 31, 2021

Fdupes does that too

andmarios · on Aug 31, 2021

Unfortunately fdupes does not do this or at least nothing in their docs points to this functionality.

I say unfortunately because before writing duphard, I tested fdupes and a couple other utilities (duff and duperemove) but none offered the functionality I needed.

coryrc · on Sept 1, 2021

Ah, my mistake, it's jdupes (maintained/improved fork of fdupes):

  -L --linkhard     hard link all duplicate files without prompting

andmarios · on Sept 2, 2021

Ah, thank you! I didn't know about it, it looks good and is part of debian and ubuntu repos (so easy to install).

pkolaczk · on Aug 31, 2021

This program uses CRC32 to compute hashes. This is a terrible idea - a 32bit hash is just too short and the probability of collisions is way too high. Only a few thousand files are enough to get 50% probability of a collision. Even though this is decreased by additional matching by extensions and sizes, I wouldn't trust it to delete any files.

Use fclones, fslint, jdupes, rdfind instead, which either use much stronger hashes (128-bit) or even verify files by direct byte-to-byte comparison.

matzf · on Aug 30, 2021

Ah yes, that's exactly what CRC32 is supposed to be used for. And it's even quicker if you don't compute it over the whole file, brilliant!

pkolaczk · on Aug 31, 2021

AFAIK CRCs are not the fastest "hashes" you can get. Some non-cryptographic hashes outperform crcs by a large factor and provide longer checksums and much better statistical properties.

E.g. see this:

* https://www.strchr.com/hash_functions

* https://jpountz.github.io/lz4-java/1.2.0/xxhash-benchmark/

mimentum · on Aug 31, 2021

I've been using 'czkawka' since it's earliest inception. Seems to do a similar tast using file hashes but can also search through to match Pictures and the like. https://github.com/qarmin/czkawka

m-manu · on Aug 31, 2021

As the author of the tool, thanks a lot for wonderful inputs! Many comments are actionable. I'll incorporate them in code soon.

Now to address a few concerns:

# The tool doesn't delete anything -- As the name suggests, it just finds duplicates. Check it out.

# File uniqueness is determined by file extension + file size + CRC32 of first 4KiB, middle 2KiB and last 2KiB

# Above seems not much. But, on my portable hard drive with >172K files (mix of video, audio, pics and source code), I got the same number of collisions as that of "SHA-256 of entire file" (By the way, I'm planning to add an option in the tool to do this)

smusamashah · on Aug 31, 2021

This sounds interesting and should probably be able to run on whole system. What if you run in the files of the OS itself? e.g. Whole C drive or where Linux system files are. Will there be any collisions?

How does it handle small files?

m-manu · on Sept 2, 2021

Yes, you can run it on whole hard drive. If you're planning to use it on "low entropy" file formats like "header" files, then use -thorough option.

m-manu · on Sept 2, 2021

Changes have been incorporated. FYI.

karteum · on Aug 31, 2021

FWIW if people are interested, I wrote https://github.com/karteum/kindfs for the purpose of indexing the hard drive, with the following goals

* being able to detect not only duplicate files but also duplicate dirs (without returning all their sub-contents as duplicates)

* being able to query multiple times without having to re-scan, and to do other types of queries (i.e. I am computing a hash on all files, not only of those with duplicate sizes. This makes scanning slower but enables other use-cases. N.b. beware that I only hash fixed portions of files for files>3MB, which is enough for my use-case considering that I always triple-check the results and is a reasonable tradeoff for performance, but it might not be OK for everyone !)

* being able to tell whether all files in dir1/ are included in dir2/ (regardless of file/dir structure)

* being able to mount the sqlite index as a FUSE FS (which is convenient for e.g. diff -r or qdirstat...)

Still work-in-progress, yet it works for several of my use-cases

scns · on Aug 30, 2021

<irony> RESF checking in </irony>

The first one i found and still use when it got obvious that fslint is EOL is czkawka [0] (meaning hiccup in polish). Its' speed is an order of magnitude higher than fslint, memory use is 20%-75%.

<;)> Satisfied customer, would buy it again. </;)>

[0] https://github.com/qarmin/czkawka

Bostonian · on Aug 30, 2021

On Windows if you download the same file more than once you will have foo.doc, "foo (1).doc", "foo (2).doc" etc. A script that just looked for files with such names, compared them to foo.doc, and deleted them if they are the same would be useful.

sumtechguy · on Aug 30, 2021

http://malich.ru/duplicate_searcher

I have had pretty good luck with that one. I used to use 'duplicate commander' but I am not sure that one is out there anymore.

HumblyTossed · on Aug 30, 2021

Does it only find duplicate files or will it also find duplicate directory hierarchies?

Example:

/some/location/one/January/Photos

/some/location/two/January/Photos

I need a tool that would return a match on January directory.

It would be great to be able to filter things. So for example, if I have backups of my dev folder, I want to filter out all the virtual envs (venv below): /home/HumblyTossed/dev/venv/bin /home/HumblyTossed/backups/dev/venv/bin

worldeva · on Aug 31, 2021

Sounds like you're looking for `find` or `fd`

https://man7.org/linux/man-pages/man1/find.1.html (-mindepth -maxdepth can also be added to make it stricter)

  find some/location -type d -wholename '*/January/Photos'

https://github.com/sharkdp/fd

  fd -p '*/January/Photos' some/location

chalcolithic · on Aug 30, 2021

If you downvote HumblyTossed's comment please explain why.

fintler · on Aug 30, 2021

If you want something that scales horizontally (mostly), dcmp from https://github.com/hpc/mpifileutils is an option. It can chunk up files and do the comparison in parallel on multiple servers.

yandrypozo · on Aug 30, 2021

I'm curious, why did the author read only on three sections for each file? is related on how CRC32 works?

selcuka · on Aug 30, 2021

It is to save time by not reading the whole file for files larger than `thresholdFileSize`. The code calls it fuzzy hash.

The author says that they tested it on 172K+ files and it's safe, but I still wouldn't trust it enough to delete files from my filesystem.

m-manu · on Sept 2, 2021

The tool doesn't delete anything -- just reports. If you're uncomfortable with "fuzzy hash", use the -thorough cmd line option.

selcuka · on Sept 2, 2021

A better implementation could be to perform a full hash (not CRC32 though; maybe even a byte-by-byte comparison) when the fuzzy hashes match, which is a small probability anyway.

unnouinceput · on Aug 30, 2021

No. CRC32 doesn't care what you throw at it. It is related to speed of building a "database", for lack of a better word, of the file. Instead of CRC32 entire file, you just get chunks of it, increasing the speed. However this approach it's definitely flawed as there are plenty of file types that have the beginning and the end identical, so only the readings/CRC32 of the middle section might be actually useful. But CRC32 has a lower space hence collisions have higher chance to happen.

The better approach might be, for same size files, to just Seek(FileSize div 2) and read 32 bytes from there. If those are identical with another file then start a full file comparison until one character diverges then stop. If multiple files are having these same middle bytes then maybe do, for each file, a full SHA256 and compare those.

Also, as other commenters pointed, you might have same info but meta is different (videos, pictures, etc) so that needs to be implemented as well.

magicalhippo · on Aug 30, 2021

I wrote my own duplicate file finder way back in the days.

I did the obvious trick of binning by size before trying to compute any hashes, and was mildly surprised to find how few out of my ~million files had exactly the same size.

For multiple files with identical size I just did the full file MD5, we only had HDD's back then and we all know how much they like random access.

unnouinceput · on Aug 31, 2021

I wrote one too, over 20 years ago. Still works, that .exe, even today. Unsurprisingly I was using CRC32 too. When I look at the code that is there I cringe, such is the mess there. Oh well, everyone has to start somewhere.