paikkos at googlemail.com
Wed Jan 13 14:58:59 GMT 2010
2010/1/13 Avi Greenbury <avismailinglistaccount at googlemail.com>:
> You might've missed his point.
> If two files are of different sizes, they cannot be identical. Getting
> the size of a file is substantially cheaper than hashing it.
> So you check all your filesizes, and need only hash those pairs or
> groups that are all the same size.
Sorry guess I didn't make myself clear. I need to store the SHA in an
SQLite file. I have a few files to handle now but I will get a
constant dribble from now on. I want to try and ensure that I haven't
already databased a file that I'll process in the future.
Incident I get poor results from the MD5 compared with SHA so I can't
relie on MD5 for
MD5 (md5_base64) results:
mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32
MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32
mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32
PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32
SHA (b64digest) results:
mr_485_htu_AST.pdf PqsBpkKgGxdEHvkoNyou1NV5kuY 27
MR_2891.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27
duplicate.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27
MR_2898.pdf ULBRZQB00qZIfIWD7oqdpfVpFtw 27
PR_A02.pdf 6LdF6sWZnyLdWj44inFI6MSaUY4 27
mr_485_htu_hrt.pdf 0VNwG7IiaIneEX3jh3SBUBaXMK0 27
PR_A01.pdf JS33nJhzTo9YTqRWe01xnOb6bEM 27
> Thirdly, be aware of what hashing guarantees. It does *not* guarantee
> uniqueness, it just gives you a very low chance that two files with
> the same hash are different. It does guarantee that files with
> different hashes are different, though.
I think that's the best I can hope for. If that 'duplicate.pdf' turned
up again at least I be able to correctly identify it. That's the goal.
I will give fdupes a look too.
More information about the london.pm