SHA question

Wed Jan 13 13:12:28 GMT 2010

2010/1/13 Roger Burton West <roger at firedrake.org>:
> On Wed, Jan 13, 2010 at 12:44:47PM +0000, Dermot wrote:
>
>>I have a lots of PDFs that I need to catalogue and I want to ensure
>>the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
>>something similar with SHA1 and binary files.  Am I right in thinking
>>that the code below is only taking the SHA on the name of the file and
>>if I want to ensure uniqueness of the content I need to do something
>>similar but as a file blob?
>
> Yes.
>
> You may want to be slightly cleverer about it - taking a SHAsum is
> computationally expensive, and it's only worth doing if the files have
> the same size.

Unfortunately the size varies quite a bit. There are a few 11Mb pdfs
but the majority are under 1mb. This application isn't for public
consumption so I don't have to worry about speed. However there are
other services on the server and I wouldn't want to blindly slurp a
50mb pdf I guess.

> If you don't require a pure-Perl solution, bear in mind that all this
> has been done for you in the "fdupes" program, already in Debian or at
> http://netdial.caribe.net/~adrian2/programs/ .

I am using it in a perl class but if I could system(`fdupes`) that
might be preferable. I'll try building the sources and see what
happens. Failing that I'll have to fallback to slurping and SHA or
MD5.

Thanx,
Dp.