SHA question

Dermot paikkos at googlemail.com
Wed Jan 13 13:12:28 GMT 2010


2010/1/13 Roger Burton West <roger at firedrake.org>:
> On Wed, Jan 13, 2010 at 12:44:47PM +0000, Dermot wrote:
>
>>I have a lots of PDFs that I need to catalogue and I want to ensure
>>the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
>>something similar with SHA1 and binary files.  Am I right in thinking
>>that the code below is only taking the SHA on the name of the file and
>>if I want to ensure uniqueness of the content I need to do something
>>similar but as a file blob?
>
> Yes.
>
> You may want to be slightly cleverer about it - taking a SHAsum is
> computationally expensive, and it's only worth doing if the files have
> the same size.

Unfortunately the size varies quite a bit. There are a few 11Mb pdfs
but the majority are under 1mb. This application isn't for public
consumption so I don't have to worry about speed. However there are
other services on the server and I wouldn't want to blindly slurp a
50mb pdf I guess.

> If you don't require a pure-Perl solution, bear in mind that all this
> has been done for you in the "fdupes" program, already in Debian or at
> http://netdial.caribe.net/~adrian2/programs/ .

I am using it in a perl class but if I could system(`fdupes`) that
might be preferable. I'll try building the sources and see what
happens. Failing that I'll have to fallback to slurping and SHA or
MD5.

Thanx,
Dp.



More information about the london.pm mailing list