If you’ve been reading any tech news today, you probably heard that Robert Scoble was banned from Facebook for hacking it with an automated scraper to get his Facebook friends into Plaxo. Later today, Facebook reinstated his account after warning him to “refrain from running these types of scripts again.”
What was Scoble after? Your names, email addresses, and birthday. Information that he is allowed access to inside Facebook, but which his many of 5,000 so-called friends might not want hauled outside and stored with another company. Buzzmachine is right when they label him an identity thief in What he says:
I want Facebook to protect my email address. I don’t want Scoble downloading it and giving it over to Plaxo, a brand and company I will never, never trust and would never choose to do business with or hand data to on my own. So much of the reaction to this little incident gets it backwards; there has been much talk about how we should be able to get our data out of Facebook and that’s fine but we also need to protect our data from others making use of it without our permission and that’s what this is about in the end.
There’s a reason that I have set my privacy to avoid these things–in addition to defriending everyone I don’t actually know and trust. I don’t want people knowing where I live (as I’ve received death threats, prank calls, and various harassments that are more trouble to sort out then just avoid). I don’t want them knowing my email, phone number, or birthday. And I certainly would get pissed off to see someone harvesting them en-masse. As I wrote in Cornell violates mass student privacy, “Taken one-by-one, this kind of directory information is completely useless and publicly available. But when taken in aggregate form, the contact information is a secret.”
So, in mass-downloading his Facebook friends’ information, Scoble violated the Terms of Service, the implicit trust relationships he had with his Facebook friends, their privacy, and their identities. Now he claims that the information will be removed after their tests are finished, but at this point it’s too late. The cat (our identities) is out of the bag.
p.s., Techcrunch agrees as well…
In a three-part rant about peer-to-peer technologies (1, 2, 3), Mark Cuban demands that peer-to-peer technologies “die a quick death” in order to”speed up [his own] internet connection.” He suggests that “Google Video is a far better solution for audio and video distribution than any P2P solution” and that cable companies “charge for upstream bandwidth usage.”
Guess what–I already get charged for all the bandwidth I use, either up or down. When Verizon strings a fiberoptic cable to my home, I’m getting a certain amount of fixed capacity into the greater internet at large. If I want to trade a little upstream capacity for greater downstream capacity, that’s my call! Have you ever noticed that downloading over http is typically slow because there are 100s of clients and 1 host? If I download the same information over bittorrent, I can sustain 12Mbs because everyone is a server–including me. Distributed protocols, such as the ones powering Amazon Dynamo or bittorrent, are more efficient, cost effective, and fault tolerant than single-server models.
Reactions around the blogosphere indicate that Mark Cuban’s thoughts on P2P are nonsensical rubbish. Mashable calls him “a guy who does not understand how P2P works, and yet he wants it shut down.” Ars Technica notes that “if users who are currently saturating their connections with BitTorrent start saturating their connections with Google Video content, the end result is more or less the same.” And a slashdotter comments, “Just imagine how fast the internet would be if there were no content to view. After P2Ps gone, get rid of all these freeloading websites, emails, etc. and it will be blisteringly fast.”
My guess is that billionaire Mark Cuban has a slow, shared cable internet connection at home, the modern equivalent of a party line. This might lead him to confuse his own slow internet connection with a greater systemic problem. What he should be complaining about is why Verizon hasn’t strung fiber in his area yet.
Today I had the pleasure of a random guy in Mexico recursively downloading as much of my site as he could, which sent my CPU load to 2.0, a level that Dreamhost would find acceptable but which I personally freak out about. The r-dns and IP of this guy are:
He started at 04/Nov/2007:12:04:36 and ended (by iptables ban) at 04/Nov/2007:20:17:03. In those 8 hours and thirteen minutes, he made over 250,000 requests. That’s an extra 8.5 requests per second from a single IP, which is clearly unacceptable behavior:
[root@fc624389 ~]# cat access_log | grep 126.96.36.199 | wc -l
If you don’t believe me, the next biggest offender over the last 24 hours made only 4,400 requests:
[root@fc624389 ~]# cat access_log | cut -d’ ‘ -f1 | sort -n | uniq -c | sort -nr | more
The user agent of this guy doesn’t tell *me* anything about him, but maybe one of you readers has an idea?
188.8.131.52 – - [04/Nov/2007:12:04:38 -0500] “GET /wp-content/themes/greenmarinee/images/links_bullet.gif HTTP/1.1″ 200 467 “http://celebrity-photos.elliottback.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Media Center PC 3.0; .NET CLR 1.0.3705; .NET CLR 1.1.4322)”
Another thing that bugs me is he requested each URL about 7 times. WTF? Do you really need to spider my site as fast as you can seven times?
[root@fc624389 ~]# cat access_log | grep 184.108.40.206 | cut -d’ ‘ -f11 | sort | uniq | wc -l
I am either thinking of writing a very evil script to confuse non-google/msn/live/ask/yahoo bots by writing in an infinite number of invisible links into my websites, or installing some kind of mod_throttle into my apache. It looks like mod_limitipconn might help here, too.