philip.newton at gmail.com
Wed Jan 13 15:16:02 GMT 2010
On Wed, Jan 13, 2010 at 15:58, Dermot <paikkos at googlemail.com> wrote:
> 2010/1/13 Avi Greenbury <avismailinglistaccount at googlemail.com>:
>> You might've missed his point.
>> If two files are of different sizes, they cannot be identical. Getting
>> the size of a file is substantially cheaper than hashing it.
>> So you check all your filesizes, and need only hash those pairs or
>> groups that are all the same size.
> Sorry guess I didn't make myself clear. I need to store the SHA in an
> SQLite file.
I think you're putting the cart before the horse.
Did someone come up to you and say, "Dermot, put the SHA value in a database."?
I would have thought that you *need* to make sure that you detect
duplicate files (for example, to avoid processing "the same" file
twice). Storing the SHA in an SQLite file is a method you would *like*
to use to accomplish this, but may not be the only way nor the best
Along those lines, you may wish to store the filesize in bytes in your
database as well, as a first point of comparison; if the filesize is
unique, then the file must also be unique and you could save yourself
the time spent calculating a digest of the file's contents -- no
1058-byte file can be the same as any 1927-byte file.
> Incident I get poor results from the MD5 compared with SHA so I can't
> relie on MD5 for
> MD5 (md5_base64) results:
> mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32
> MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
> duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
> MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
> PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32
> mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32
> PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32
> SHA (b64digest) results:
> mr_485_htu_AST.pdf PqsBpkKgGxdEHvkoNyou1NV5kuY 27
> MR_2891.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27
> duplicate.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27
> MR_2898.pdf ULBRZQB00qZIfIWD7oqdpfVpFtw 27
> PR_A02.pdf 6LdF6sWZnyLdWj44inFI6MSaUY4 27
> mr_485_htu_hrt.pdf 0VNwG7IiaIneEX3jh3SBUBaXMK0 27
> PR_A01.pdf JS33nJhzTo9YTqRWe01xnOb6bEM 27
That's... odd. md5sum's guarantee of "same if the hashes match" isn't
as strong as SHA's, but I still wouldn't expect two files to md5sum
the same if their SHA sums don'T match.
However, those MD5 sums don't look like base-64 to me, so maybe you're
doing something wrong somewhere.
Philip Newton <philip.newton at gmail.com>
More information about the london.pm