SHA question

Philip Newton philip.newton at gmail.com
Wed Jan 13 15:16:02 GMT 2010


On Wed, Jan 13, 2010 at 15:58, Dermot <paikkos at googlemail.com> wrote:
> 2010/1/13 Avi Greenbury <avismailinglistaccount at googlemail.com>:
>
>> You might've missed his point.
>>
>> If two files are of different sizes, they cannot be identical. Getting
>> the size of a file is substantially cheaper than hashing it.
>>
>> So you check all your filesizes, and need only hash those pairs or
>> groups that are all the same size.
>
> Sorry guess I didn't make myself clear. I need to store the SHA in an
> SQLite file.

I think you're putting the cart before the horse.

Did someone come up to you and say, "Dermot, put the SHA value in a database."?

I would have thought that you *need* to make sure that you detect
duplicate files (for example, to avoid processing "the same" file
twice). Storing the SHA in an SQLite file is a method you would *like*
to use to accomplish this, but may not be the only way nor the best
way.

Along those lines, you may wish to store the filesize in bytes in your
database as well, as a first point of comparison; if the filesize is
unique, then the file must also be unique and you could save yourself
the time spent calculating a digest of the file's contents -- no
1058-byte file can be the same as any 1927-byte file.

> Incident I get poor results from the MD5 compared with SHA so I can't
> relie on MD5 for
>
> MD5 (md5_base64) results:
> mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
> MR_2891.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
> duplicate.pdf         01f73c142dae9f9f403bbab543b6aa6f 32
> MR_2898.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
> PR_A02.pdf           5552e6587357f9967dc0bc83153cca63 32
> mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
> PR_A01.pdf           5552e6587357f9967dc0bc83153cca63 32
>
> SHA (b64digest) results:
> mr_485_htu_AST.pdf   PqsBpkKgGxdEHvkoNyou1NV5kuY 27
> MR_2891.pdf          bQhWA445KFzXy6ldF/DSoG2xTEY 27
> duplicate.pdf         bQhWA445KFzXy6ldF/DSoG2xTEY 27
> MR_2898.pdf          ULBRZQB00qZIfIWD7oqdpfVpFtw 27
> PR_A02.pdf           6LdF6sWZnyLdWj44inFI6MSaUY4 27
> mr_485_htu_hrt.pdf   0VNwG7IiaIneEX3jh3SBUBaXMK0 27
> PR_A01.pdf           JS33nJhzTo9YTqRWe01xnOb6bEM 27

That's... odd. md5sum's guarantee of "same if the hashes match" isn't
as strong as SHA's, but I still wouldn't expect two files to md5sum
the same if their SHA sums don'T match.

However, those MD5 sums don't look like base-64 to me, so maybe you're
doing something wrong somewhere.

Cheers,
Philip
-- 
Philip Newton <philip.newton at gmail.com>



More information about the london.pm mailing list