Latent Semantic Indexing can improve your WordPress search results
Latent Semantic Indexing (LSI) can improve the quality of Wordpress search results dramatically. Rather than just look for any one of a set of keywords in the body of your posts, LSI creates a low-rank approximation of the relationship between your blog posts and the words you use. Since the document term-space is of much lower order than the original document-term matrix, words with related semantic value (i.e., “Microsoft” and “Bill Gates”) become associated, and searches for one term will return results that are closely related.
Some examples:
Take a look at this query for “writing good code”. Naturally, you would like Wordpress to return articles about coding practices, or even computer programming at all! However, the first three matches are Ludacris lyrics, ethical blogging, and finally something useful–Microsoft interview tales. Now, take a look at what I get back with LSI: Google Desktop Search, Heavyweight Categories plugin, and Things I want to do for Wordpress. These, to me anyway, seem a little more relevant. And, if you do try looking for “rap music” with the LSI technique, one of your results is Pot Smokers = Psychotic. Now how relevent is that?
If you need more proof, “pop culture” gives me Paris Hilton, and sex returns The “really big” boys get it wrong.
Some downsides:
To do LSI, you have to create a term document matrix, which will be really big. Mine is 12,525 x 726, and takes up 40 mb of space in full form. Of course, it’s a very sparse matrix, so you can store it in a sparse structure and save most of the space. However, you still have to compute the SVD of that huge matrix, and do a number of painful multiplications and solvings. In other words, LSI is a little slow for a web application. Queries on my p4 here at home take as long as a minute to run–imagine the wait on a loaded server!
Still, the results are astounding, and the WP dev’s should definitely code up a hack!
This entry was posted on Wednesday, April 27th, 2005 at 8:32 am and is tagged with google desktop search, pot smokers, ludacris lyrics, application queries, paris hilton and sex, document matrix, semantic value, rap music, term space, sparse matrix, google, multiplications, bill gates, paris hilton, lsi, big boys, latent semantic indexing, ludacris, svd, computer programming. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback.

Add New Comment
Viewing 6 Comments
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Add New Comment
Trackbacks
(Trackback URL)
1/31/2007 at 4:26 am
[...] Elliott points towards Latent Semantic Indexing (LSI) for improving the search. It might not be a viable option today, ...