Perl Vacancy

Marvin Humphrey marvin at
Wed Aug 16 16:16:29 BST 2006

On Aug 16, 2006, at 3:50 AM, Nicholas Clark wrote:

> 3: The job adverts are often rather free-form - it's possible to  
> search on
>    keywords (eg "perl") but that brings up every job that mentions  
> that
>    skill, even if it's a "nice to have".

I really wish that had ranked search and search-keyword  
highlighting rather than just substring search.  I sometimes search  
there for "search" :) and then I have to trawl through all the  
documents which contain "research".

I'm tempted to ping Ask about it.  But then, I'm also tempted to  
implement decent search engines for Perl Monks, the mailing  
list archives, etc., and I only have so much time.  :)

> Basically, I'd like to be able to quickly filter 200 job ads down  
> to the 10
> worth reading. Search engines such as Google let me do this for the
> entire Internet, including finding "similar" pages and not showing  
> them all
> - how come this sort of technology isn't there for job adverts?

While Google is a black box so you never know, the "Similar pages"  
function appears to be based on LSA or Latent Semantic Analysis, also  
known as LSI or Latent Semantic Indexing.  Here's an excellent  
introduction: <>.

LSA is basically a clever way of approximating the results of a pure  
vector-space search engine (see < 
engine.html>) for large corpuses. Pure vector-space search engines  
map all documents into an N-dimensional space, where N is the number  
of unique terms in the corpus.  Similar documents appear near each  
other in this space.

The results are akin to what you would get if you could enter the  
entire contents of a document into a search box.  It's possible to  
hack a poor-man's version of this into KinoSearch, by porting the  
MoreLikeThisQuery from Java Lucene's contrib section, which sucks the  
rarest terms out of a document and searches on those.

That kind of search can produce some pretty bizarre results, though.   
Proper names are extremely discriminatory, high value search terms  
(high IDF or Inverse Document Frequency), so if you happen to get a  
couple of documents which both contain "Nicholas" and "Clark", those  
documents will appear quite near each other in vector space, and may  
overwhelm your initial search for "perl job".

Marvin Humphrey
Rectangular Research

More information about the mailing list