Great work! Thanks for open sourcing this - its very educational.
At the moment I'm using it to process a few hundred gigs of song files that I've collected as a big furry hairball of a mess over the years - something about having multiple iPods and MP3 players over the years, and not really doing very good house-keeping in the move from one to the other (and avoiding things like iTunes where possible) has meant that I have a lot of files that may have duplicate songs in them - but the filenames and organization doesn't necessarily reflect that fact.
So I'm using dejavu right now to clean this up .. I'm assuming you'd be happy to have a "find_duplicates.py" script added - if so, I'll let you know as soon as I have one working .. ;)
I'd be curious as well to see how the performance holds up getting into the terabytes as I haven't tested that. Remember too that there are a lot of parameters for the matching algorithm here (https://github.com/worldveil/dejavu/blob/master/dejavu/finge...) which allow you to trade off accuracy, speed, and storage in different ways. I've tried to document it throughly.
Finding duplicates is a great one! Actually generating a checksum for each audio file (minus the header and ID3 tags) and adding this as a column in the songs table for all the different filetypes Dejavu supports (mp3, wav, etc) would probably be the best way to do this.
I say this because so many songs today are built on sampling. Mashups and EDM music often samples from other work, and as such, the fingerprints and their alignment can be shared across different songs. Something more clever like seeing the percentage of hashes by song that are the same and comparing to a threshold might do the trick, though.
At the moment I'm using it to process a few hundred gigs of song files that I've collected as a big furry hairball of a mess over the years - something about having multiple iPods and MP3 players over the years, and not really doing very good house-keeping in the move from one to the other (and avoiding things like iTunes where possible) has meant that I have a lot of files that may have duplicate songs in them - but the filenames and organization doesn't necessarily reflect that fact.
So I'm using dejavu right now to clean this up .. I'm assuming you'd be happy to have a "find_duplicates.py" script added - if so, I'll let you know as soon as I have one working .. ;)
Thanks again!