marvin at rectangular.com
Wed Aug 16 16:16:29 BST 2006
On Aug 16, 2006, at 3:50 AM, Nicholas Clark wrote:
> 3: The job adverts are often rather free-form - it's possible to
> search on
> keywords (eg "perl") but that brings up every job that mentions
> skill, even if it's a "nice to have".
I really wish that jobs.perl.org had ranked search and search-keyword
highlighting rather than just substring search. I sometimes search
there for "search" :) and then I have to trawl through all the
documents which contain "research".
I'm tempted to ping Ask about it. But then, I'm also tempted to
implement decent search engines for Perl Monks, the perl.org mailing
list archives, etc., and I only have so much time. :)
> Basically, I'd like to be able to quickly filter 200 job ads down
> to the 10
> worth reading. Search engines such as Google let me do this for the
> entire Internet, including finding "similar" pages and not showing
> them all
> - how come this sort of technology isn't there for job adverts?
While Google is a black box so you never know, the "Similar pages"
function appears to be based on LSA or Latent Semantic Analysis, also
known as LSI or Latent Semantic Indexing. Here's an excellent
LSA is basically a clever way of approximating the results of a pure
vector-space search engine (see <http://www.perl.com/pub/a/2003/02/19/
engine.html>) for large corpuses. Pure vector-space search engines
map all documents into an N-dimensional space, where N is the number
of unique terms in the corpus. Similar documents appear near each
other in this space.
The results are akin to what you would get if you could enter the
entire contents of a document into a search box. It's possible to
hack a poor-man's version of this into KinoSearch, by porting the
MoreLikeThisQuery from Java Lucene's contrib section, which sucks the
rarest terms out of a document and searches on those.
That kind of search can produce some pretty bizarre results, though.
Proper names are extremely discriminatory, high value search terms
(high IDF or Inverse Document Frequency), so if you happen to get a
couple of documents which both contain "Nicholas" and "Clark", those
documents will appear quite near each other in vector space, and may
overwhelm your initial search for "perl job".
More information about the london.pm