Elliott C. Back: Technology FTW!

DOS v.s. Index Retrieval

Posted in Computers & Technology, Search by Elliott Back on November 18th, 2005.

The incident at Technorati where a determined data miner is using a dictionary of keywords to peek into the Technorati index is not an attack, really. And it’s only a denial of service by coincidence because you guys can’t handle the volume! I have to ask what’s wrong with having someone query your index in breadth and depth? What’s wrong with automating it to pull useful data? Nothing–it’s a public service. Of course Technorati can decide that to IPs x, y, and z it doesn’t want to deliver service, but as long as they allow you in, you should be fine searching for anything you want.

Something about their attitude towards their clients here bothers me. Shouldn’t they be more open? Why are these data-miners being called hackers?

This entry was posted on Friday, November 18th, 2005 at 9:30 am and is tagged with data miners, using a dictionary, index retrieval, data miner, denial of service, technorati, breadth, coincidence, peek, hackers, attitude. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback.

Viewing 2 Comments

    • ^
    • v
    Elliot,

    You make some interesting observations, and I wanted to respond.

    First of all, my bad for using the word "hacker". The word has various connotations, both positive and negative, none of which is really appropriate for this situation. After all, some of my best friends are hackers. Thanks for pointing this out. I updated my post accordingly.

    On the question of people trying to mine our databases, it's really a question of our Terms of Service. Our service is intended for non-commercial use by individuals. Like most public search engines, our service is advertising-sponsored. When someone uses our public-facing service in some other way, they are violating our Terms of Service. I'm quite sure that any other public search engine would view this issue very similarly.

    As for being open, we really try to be. We have a full-featured API that hundreds of developers have used to build some very useful applications. We have given snapshots of our data to academic and corporate researchers, in order to produce greater value from it. Use of our API and data is free for non-commercial use. We are open to commercial relationships as well, but those need to be negotiated upfront. Simply put, if someone wants our data, they should just ask us.

    On the question of our ability to handle query volume, let me say that we work hard every day, not to mention spending a lot of money every month, to ensure we have adequate capacity to deliver good performance to our users.

    In this case, it was not the query volume that gave us trouble. Rather, these data-mining programs have the effect of subverting our caches. Like most high-performance, high-volume services, we build caches based on expected service usage. These programs fall well outside expected usage patterns, and that's what gave us trouble.

    I hope this helps you understand our position a little better. Thanks for taking the time to write about us. As always, we really appreciate the input.

    Adam Hertz
    Vice President of Engineering
    Technorati, Inc.
    • ^
    • v
    Thanks Adam for the response. It certainly clarifies a lot of what you were thinking at the time you made that post, and hearing about the cache subversion is interesting. I wonder what kind of usage pattern they must have been using to cause cache subversion--probably very high-speed mirroring. It's one thing to get slowly crawled, it's another to get mirrored.
 

Trackbacks

(Trackback URL)

close Reblog this comment
blog comments powered by Disqus