SHA question

Wed Jan 13 16:58:16 GMT 2010

On Wed, Jan 13, 2010 at 07:16, Philip Newton <philip.newton at gmail.com> wrote:
> On Wed, Jan 13, 2010 at 15:58, Dermot <paikkos at googlemail.com> wrote:
>> 2010/1/13 Avi Greenbury <avismailinglistaccount at googlemail.com>:
>>
>>> You might've missed his point.
>>>
>>> If two files are of different sizes, they cannot be identical. Getting
>>> the size of a file is substantially cheaper than hashing it.
>>>
>>> So you check all your filesizes, and need only hash those pairs or
>>> groups that are all the same size.
>>
>> Sorry guess I didn't make myself clear. I need to store the SHA in an
>> SQLite file.
>
> I think you're putting the cart before the horse.
>
> Did someone come up to you and say, "Dermot, put the SHA value in a database."?
>
> I would have thought that you *need* to make sure that you detect
> duplicate files (for example, to avoid processing "the same" file
> twice). Storing the SHA in an SQLite file is a method you would *like*
> to use to accomplish this, but may not be the only way nor the best
> way.
>
> Along those lines, you may wish to store the filesize in bytes in your
> database as well, as a first point of comparison; if the filesize is
> unique, then the file must also be unique and you could save yourself
> the time spent calculating a digest of the file's contents -- no
> 1058-byte file can be the same as any 1927-byte file.

If you're storing the collision data (size, hash, whatever) to protect
against future collisions the only way this scheme of avoiding more
expensive ops like hashing will work (AFAICS) is if you have some
fiddlier code to lazily hash an old file when a newer future file
comes along that matches an existing file size.

>> Incident I get poor results from the MD5 compared with SHA so I can't
>> relie on MD5 for
>>
>> MD5 (md5_base64) results:
>> mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
>> MR_2891.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
>> duplicate.pdf         01f73c142dae9f9f403bbab543b6aa6f 32
>> MR_2898.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
>> PR_A02.pdf           5552e6587357f9967dc0bc83153cca63 32
>> mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
>> PR_A01.pdf           5552e6587357f9967dc0bc83153cca63 32
>>
>> SHA (b64digest) results:
>> mr_485_htu_AST.pdf   PqsBpkKgGxdEHvkoNyou1NV5kuY 27
>> MR_2891.pdf          bQhWA445KFzXy6ldF/DSoG2xTEY 27
>> duplicate.pdf         bQhWA445KFzXy6ldF/DSoG2xTEY 27
>> MR_2898.pdf          ULBRZQB00qZIfIWD7oqdpfVpFtw 27
>> PR_A02.pdf           6LdF6sWZnyLdWj44inFI6MSaUY4 27
>> mr_485_htu_hrt.pdf   0VNwG7IiaIneEX3jh3SBUBaXMK0 27
>> PR_A01.pdf           JS33nJhzTo9YTqRWe01xnOb6bEM 27
>
> That's... odd. md5sum's guarantee of "same if the hashes match" isn't
> as strong as SHA's, but I still wouldn't expect two files to md5sum
> the same if their SHA sums don'T match.
>
> However, those MD5 sums don't look like base-64 to me, so maybe you're
> doing something wrong somewhere.
>
> Cheers,
> Philip
> --
> Philip Newton <philip.newton at gmail.com>
>
>