paikkos at googlemail.com
Wed Jan 13 13:12:28 GMT 2010
2010/1/13 Roger Burton West <roger at firedrake.org>:
> On Wed, Jan 13, 2010 at 12:44:47PM +0000, Dermot wrote:
>>I have a lots of PDFs that I need to catalogue and I want to ensure
>>the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned
>>something similar with SHA1 and binary files. Am I right in thinking
>>that the code below is only taking the SHA on the name of the file and
>>if I want to ensure uniqueness of the content I need to do something
>>similar but as a file blob?
> You may want to be slightly cleverer about it - taking a SHAsum is
> computationally expensive, and it's only worth doing if the files have
> the same size.
Unfortunately the size varies quite a bit. There are a few 11Mb pdfs
but the majority are under 1mb. This application isn't for public
consumption so I don't have to worry about speed. However there are
other services on the server and I wouldn't want to blindly slurp a
50mb pdf I guess.
> If you don't require a pure-Perl solution, bear in mind that all this
> has been done for you in the "fdupes" program, already in Debian or at
> http://netdial.caribe.net/~adrian2/programs/ .
I am using it in a perl class but if I could system(`fdupes`) that
might be preferable. I'll try building the sources and see what
happens. Failing that I'll have to fallback to slurping and SHA or
More information about the london.pm