Hacker Newsnew | past | comments | ask | show | jobs | submit | m-manu's commentslogin

My homepage is on Github. Didn't know Cloudflare also provides free hosting. Thanks.


The tool doesn't delete anything -- just reports. If you're uncomfortable with "fuzzy hash", use the -thorough cmd line option.


A better implementation could be to perform a full hash (not CRC32 though; maybe even a byte-by-byte comparison) when the fuzzy hashes match, which is a small probability anyway.


As the author of the tool, thanks a lot for wonderful inputs! Many comments are actionable. I'll incorporate them in code soon.

Now to address a few concerns:

# The tool doesn't delete anything -- As the name suggests, it just finds duplicates. Check it out.

# File uniqueness is determined by file extension + file size + CRC32 of first 4KiB, middle 2KiB and last 2KiB

# Above seems not much. But, on my portable hard drive with >172K files (mix of video, audio, pics and source code), I got the same number of collisions as that of "SHA-256 of entire file" (By the way, I'm planning to add an option in the tool to do this)


This sounds interesting and should probably be able to run on whole system. What if you run in the files of the OS itself? e.g. Whole C drive or where Linux system files are. Will there be any collisions?

How does it handle small files?


Yes, you can run it on whole hard drive. If you're planning to use it on "low entropy" file formats like "header" files, then use -thorough option.


Changes have been incorporated. FYI.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: