SHA question

Thu Jan 14 14:16:22 GMT 2010

On Wed, Jan 13, 2010 at 3:16 PM, Philip Newton <philip.newton at gmail.com> wrote:

> Along those lines, you may wish to store the filesize in bytes in your
> database as well, as a first point of comparison; if the filesize is
> unique, then the file must also be unique and you could save yourself
> the time spent calculating a digest of the file's contents -- no
> 1058-byte file can be the same as any 1927-byte file.

This is only possible if you've still got all the pdfs on disk, as as
soon as you get your suspected duplicate you'll have to hash both
files' contents to tell if you have or not.  If you've sent them onto
a better place and deleted them however, then you're out of luck.

I'd just use Digest::MD5 to calculate the filesize.  It's cheap
compared to SHA, you don't care about the exact cryptographic security
of the hash, and will work even if you don't have the original to
compare again.

#!/usr/bin/perl

use Modern::Perl;
use autodie;
use Digest::MD5;

my $filename = shift;
open my $fh, "<:bytes", $filename;
my $md5 = Digest::MD5->new;
$md5->addfile($fh);
say "The file's md5 is: " .$md5->b64digest;

Don't forget the "<:bytes" (you're comparing bytes, not characters).

Once you've got a toy version up and running and you can get a "feel"
for how fast it is on your system, you can optimise if you don't like
the performance.

Mark.