paulm at paulm.com
Wed Jan 13 16:58:16 GMT 2010
On Wed, Jan 13, 2010 at 07:16, Philip Newton <philip.newton at gmail.com> wrote:
> On Wed, Jan 13, 2010 at 15:58, Dermot <paikkos at googlemail.com> wrote:
>> 2010/1/13 Avi Greenbury <avismailinglistaccount at googlemail.com>:
>>> You might've missed his point.
>>> If two files are of different sizes, they cannot be identical. Getting
>>> the size of a file is substantially cheaper than hashing it.
>>> So you check all your filesizes, and need only hash those pairs or
>>> groups that are all the same size.
>> Sorry guess I didn't make myself clear. I need to store the SHA in an
>> SQLite file.
> I think you're putting the cart before the horse.
> Did someone come up to you and say, "Dermot, put the SHA value in a database."?
> I would have thought that you *need* to make sure that you detect
> duplicate files (for example, to avoid processing "the same" file
> twice). Storing the SHA in an SQLite file is a method you would *like*
> to use to accomplish this, but may not be the only way nor the best
> Along those lines, you may wish to store the filesize in bytes in your
> database as well, as a first point of comparison; if the filesize is
> unique, then the file must also be unique and you could save yourself
> the time spent calculating a digest of the file's contents -- no
> 1058-byte file can be the same as any 1927-byte file.
If you're storing the collision data (size, hash, whatever) to protect
against future collisions the only way this scheme of avoiding more
expensive ops like hashing will work (AFAICS) is if you have some
fiddlier code to lazily hash an old file when a newer future file
comes along that matches an existing file size.
>> Incident I get poor results from the MD5 compared with SHA so I can't
>> relie on MD5 for
>> MD5 (md5_base64) results:
>> mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32
>> MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
>> duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
>> MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
>> PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32
>> mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32
>> PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32
>> SHA (b64digest) results:
>> mr_485_htu_AST.pdf PqsBpkKgGxdEHvkoNyou1NV5kuY 27
>> MR_2891.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27
>> duplicate.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27
>> MR_2898.pdf ULBRZQB00qZIfIWD7oqdpfVpFtw 27
>> PR_A02.pdf 6LdF6sWZnyLdWj44inFI6MSaUY4 27
>> mr_485_htu_hrt.pdf 0VNwG7IiaIneEX3jh3SBUBaXMK0 27
>> PR_A01.pdf JS33nJhzTo9YTqRWe01xnOb6bEM 27
> That's... odd. md5sum's guarantee of "same if the hashes match" isn't
> as strong as SHA's, but I still wouldn't expect two files to md5sum
> the same if their SHA sums don'T match.
> However, those MD5 sums don't look like base-64 to me, so maybe you're
> doing something wrong somewhere.
> Philip Newton <philip.newton at gmail.com>
More information about the london.pm