SHA question

Dermot paikkos at
Wed Jan 13 14:58:59 GMT 2010

2010/1/13 Avi Greenbury <avismailinglistaccount at>:

> You might've missed his point.
> If two files are of different sizes, they cannot be identical. Getting
> the size of a file is substantially cheaper than hashing it.
> So you check all your filesizes, and need only hash those pairs or
> groups that are all the same size.

Sorry guess I didn't make myself clear. I need to store the SHA in an
SQLite file. I have a few files to handle now but I will get a
constant dribble from now on. I want to try and ensure that I haven't
already databased a file that I'll process in the future.

Incident I get poor results from the MD5 compared with SHA so I can't
relie on MD5 for

MD5 (md5_base64) results:
mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
MR_2891.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
duplicate.pdf         01f73c142dae9f9f403bbab543b6aa6f 32
MR_2898.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
PR_A02.pdf           5552e6587357f9967dc0bc83153cca63 32
mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
PR_A01.pdf           5552e6587357f9967dc0bc83153cca63 32

SHA (b64digest) results:
mr_485_htu_AST.pdf   PqsBpkKgGxdEHvkoNyou1NV5kuY 27
MR_2891.pdf          bQhWA445KFzXy6ldF/DSoG2xTEY 27
duplicate.pdf         bQhWA445KFzXy6ldF/DSoG2xTEY 27
MR_2898.pdf          ULBRZQB00qZIfIWD7oqdpfVpFtw 27
PR_A02.pdf           6LdF6sWZnyLdWj44inFI6MSaUY4 27
mr_485_htu_hrt.pdf   0VNwG7IiaIneEX3jh3SBUBaXMK0 27
PR_A01.pdf           JS33nJhzTo9YTqRWe01xnOb6bEM 27

> Thirdly, be aware of what hashing guarantees. It does *not* guarantee
> uniqueness, it just gives you a very low chance that two files with
> the same hash are different. It does guarantee that files with
> different hashes are different, though.

I think that's the best I can hope for. If that 'duplicate.pdf' turned
up again at least I be able to correctly identify it. That's the goal.
I will give fdupes a look too.
Thanks all.

More information about the mailing list