steve at tightrope.demon.co.uk
Tue Apr 15 14:20:22 BST 2008
On Tue, Apr 15, 2008 at 01:46:43PM +0100, Nicholas Clark typed:
> Oh how I love pimp scum. Dear Google, once you've unfutzed your redirection
> ( http://use.perl.org/~nicholas/journal/36154 ), please could you create
> jobs.google.com, and apply your "similar pages" logic to de-dupe all the
> repeat job postings.
I have written python code (for a change) to do this.
The way it worked was to to screen scrap job postings, worked through all
combinations of pairs, appended each pair together and compressed the pair to
get the compression ratio which was used as metric of similarity between the
Then I was able to group similar adverts to de-dupe them. Rather crude
but it seemed to work.
It was interesting to see diffs between adverts (eg. where the pimp
had deleted information or changed first line support to second line support
Steve Mynott <steve at tightrope.demon.co.uk>
More information about the london.pm