Elliott C. Back: Internet & Technology

We have a GoogleBOT

Posted in Art, Google by Elliott Back on April 28th, 2005.

The Google Blog announces a wall mural of their Googlebot at one of the datacenters:

I AM GOOGLE-BOT!!!!!!

I suppose they want to humanize their lil’ crawler.

Privacy in the marketplace

Posted in Law, Politics by Elliott Back on April 28th, 2005.

The privacy pundits over at BoingBoing are hot and bothered because a tanning salon requires fingerprint identification to authenticate its customers. In the post, the original author writes:

WAYNE: “Hi, do you require a thumbrpint scan to get a tan there?”
TANNING BIMBO: “Yes, sir, we do.”

[...]

I think the Arkansas chapter of the ACLU and the Arkansas state attorney general’s office need to be contacted

I think the answer to this is that you don’t need to use that tanning salon. If you dislike their “invasion” of your biometric privacy, you’ll have to go somewhere else.

Latent Semantic Indexing can improve your WordPress search results

Posted in Code, Google, How to Blog, SEO, Search by Elliott Back on April 27th, 2005.

Latent Semantic Indexing (LSI) can improve the quality of Wordpress search results dramatically. Rather than just look for any one of a set of keywords in the body of your posts, LSI creates a low-rank approximation of the relationship between your blog posts and the words you use. Since the document term-space is of much lower order than the original document-term matrix, words with related semantic value (i.e., “Microsoft” and “Bill Gates”) become associated, and searches for one term will return results that are closely related.

Some examples:

Take a look at this query for “writing good code”. Naturally, you would like Wordpress to return articles about coding practices, or even computer programming at all! However, the first three matches are Ludacris lyrics, ethical blogging, and finally something useful–Microsoft interview tales. Now, take a look at what I get back with LSI: Google Desktop Search, Heavyweight Categories plugin, and Things I want to do for Wordpress. These, to me anyway, seem a little more relevant. And, if you do try looking for “rap music” with the LSI technique, one of your results is Pot Smokers = Psychotic. Now how relevent is that?

If you need more proof, “pop culture” gives me Paris Hilton, and sex returns The “really big” boys get it wrong.

Some downsides:

To do LSI, you have to create a term document matrix, which will be really big. Mine is 12,525 x 726, and takes up 40 mb of space in full form. Of course, it’s a very sparse matrix, so you can store it in a sparse structure and save most of the space. However, you still have to compute the SVD of that huge matrix, and do a number of painful multiplications and solvings. In other words, LSI is a little slow for a web application. Queries on my p4 here at home take as long as a minute to run–imagine the wait on a loaded server!

Still, the results are astounding, and the WP dev’s should definitely code up a hack!

« Previous PageNext Page »