Blog Spam (Was: Telecommuting)

Tue Dec 13 17:43:20 GMT 2011

On Tuesday, December 13, 2011 at 09:34:53 AM, Paul Makepeace wrote:
> On Tue, Dec 13, 2011 at 14:19, Ash Berlin <ash_cpan at firemirror.com> wrote:
> 
> > > One my blog got hit with (if 300 counts as a hit) was a series of
> > > short comments like that but with exact one misspelling consisting of
> > > a letter transposition. No link associated with it. Quite weird -
> > > wasn't sure what it was trying to achieve, maybe poisoning
> > > bayes/cluster filters with broken (but unusual) words so that they'd
> > > be "learnt" over time as signals for ham.
> > >
> > "Never attribute to malice that which can be explained through
> > incompetency."
> >

Many years ago when the Bayesian approach to spam detection made its way onto
the scene, I argued with my coworkers that a spammer would do well to attempt
to poison+ the filters even at the cost of bandwidth.

The idea was, if you found a domain using a learning filter, you would send
legitimate email (possibly using your own adaptive message chooser/builder)
with the aim of having that the user classify your messages as spam rather
than have their inbox flooded with bogus messages.  At which point the filter
becomes useless because of the high false positive rate.

> 
> Yeah, it wasn't the same data each time, it was a lot of different phrases,
> some of which were quite idiomatic. But: exactly one transposition
> misspelling (in a different word each time)--looked too deliberate to be a
> mistake/incompetency
> 

My take on the blog spam thing is that it would be better to create seemingly
legitimate responses based on the original  blog post using something like
the postmodern generator* along with the links the post contained, and other
training data gathered from the Internet (wiki for example). At some point in
the generated discussion there would be a link to a generated blog containing
similar information. The generated discussion would continue with references
to the generated blog perhaps containing links to other generated blogs.

The generated blogs themselves would be replete with the target link(s). 

The postmodern self-generating Internet---the blog spam wars.

-r

+ poisson or fish the filters as I wanted to call it at the time.
* http://www.csse.monash.edu.au/publications/1996/tr-cs96-264.ps.gz