SHA question

Wed Jan 13 17:48:25 GMT 2010

2010/1/13 Paul Makepeace <paulm at paulm.com>:
> On Wed, Jan 13, 2010 at 07:16, Philip Newton <philip.newton at gmail.com> wrote:
>> On Wed, Jan 13, 2010 at 15:58, Dermot <paikkos at googlemail.com> wrote:
>>> 2010/1/13 Avi Greenbury <avismailinglistaccount at googlemail.com>:
>>
>> I think you're putting the cart before the horse.
>>
>> Did someone come up to you and say, "Dermot, put the SHA value in a database."?

>> I would have thought that you *need* to make sure that you detect
>> duplicate files (for example, to avoid processing "the same" file
>> twice). Storing the SHA in an SQLite file is a method you would *like*
>> to use to accomplish this, but may not be the only way nor the best
>> way.

Yet more background. *sign* The process runs as follows:

1) A source submits some digital files.
2) Extract EXIF from digital files that may contain the name of the PDF file.
3) Find said PDF on file system.
3) DB - Have I seen this PDF before?
    Yes: Assign existing ID to the new row we're creating for it's
parent record (the digital file).
    No: Assign PDF an ID, assign ID to parent record, rename,
post/upload to remote server.

The same PDF can be come from a number of sources so that are not
unique to a source and the same PDF may appear more than one (parent)
records. The PDF exists on a remote server after that so, your right,
I don't want to process the same file twice.

>> Along those lines, you may wish to store the filesize in bytes in your
>> database as well, as a first point of comparison; if the filesize is
>> unique, then the file must also be unique and you could save yourself
>> the time spent calculating a digest of the file's contents -- no
>> 1058-byte file can be the same as any 1927-byte file.

If I go with byte size and do ('PDF')->search({ file_size => 1058})
and get 3 results I then have to back-track, take the SHA and do to
the search again. With SHA, it might be expensive but it's always
unique[1] so I can simply do ('PDF')->find_or_new({ \%hash}) and get
the ID back.  I don't think your suggesting that I relie on the file
size as a unique identifer and I can see how a search with no results
might short-circuit some stuff. But I will need that SHA when I get
files of the same size so I may as well store it from the beginning.

> If you're storing the collision data (size, hash, whatever) to protect
> against future collisions the only way this scheme of avoiding more
> expensive ops like hashing will work (AFAICS) is if you have some
> fiddlier code to lazily hash an old file when a newer future file
> comes along that matches an existing file size.
>
>>> Incident I get poor results from the MD5 compared with SHA so I can't
>>> relie on MD5 for
>>
>> That's... odd. md5sum's guarantee of "same if the hashes match" isn't
>> as strong as SHA's, but I still wouldn't expect two files to md5sum
>> the same if their SHA sums don'T match.
>>
>> However, those MD5 sums don't look like base-64 to me, so maybe you're
>> doing something wrong somewhere.

Yes, I'd better 'fess up here. I had a bug :P I was using the
hex_b4base() in a not too clever way. I should have been using
addfile().
Dp

[1] At least once in 1x10^-64