m-manu's comments

m-manu · on Sept 29, 2022

My homepage is on Github. Didn't know Cloudflare also provides free hosting. Thanks.

m-manu · on Sept 2, 2021

The tool doesn't delete anything -- just reports. If you're uncomfortable with "fuzzy hash", use the -thorough cmd line option.

selcuka · on Sept 2, 2021

A better implementation could be to perform a full hash (not CRC32 though; maybe even a byte-by-byte comparison) when the fuzzy hashes match, which is a small probability anyway.

m-manu · on Aug 31, 2021

As the author of the tool, thanks a lot for wonderful inputs! Many comments are actionable. I'll incorporate them in code soon.

Now to address a few concerns:

# The tool doesn't delete anything -- As the name suggests, it just finds duplicates. Check it out.

# File uniqueness is determined by file extension + file size + CRC32 of first 4KiB, middle 2KiB and last 2KiB

# Above seems not much. But, on my portable hard drive with >172K files (mix of video, audio, pics and source code), I got the same number of collisions as that of "SHA-256 of entire file" (By the way, I'm planning to add an option in the tool to do this)

smusamashah · on Aug 31, 2021

This sounds interesting and should probably be able to run on whole system. What if you run in the files of the OS itself? e.g. Whole C drive or where Linux system files are. Will there be any collisions?

How does it handle small files?

m-manu · on Sept 2, 2021

Yes, you can run it on whole hard drive. If you're planning to use it on "low entropy" file formats like "header" files, then use -thorough option.

m-manu · on Sept 2, 2021

Changes have been incorporated. FYI.