Elliott C. Back: Technology FTW!

Inside Elite P2P Filesharing Networks

Posted in Computers & Technology, Copyright, Cornell University, Law, Scandal, bit torrent, bittorrent by Elliott Back on September 1st, 2006.

An Introduction

You’ve heard that private file sharing networks exist, but you’ve probably never had a chance to explore one from the inside. These networks of software, music, television, and movie pirates often are run on the internal network infrastructure of private educational institutions. Because a university network has a fixed set of IP addresses, college pirates can run DC++ and write simple scripts to only allow users from the internal IP pool, or even the residential dormitory pool. This prevents unwanted interference (RIAA, MPAA, Police) with the network by simply making it invisible to the outside world. Also, most university networks are lightly-satured high-speed ethernet, giving student pirates the bandwidth to share large files.

riaa.gifWhile I attended Cornell University, students there ran a large DC++ hub to share files. There were anywhere between 1000 and 2000 users of the DC++ hub, which provided access to terabytes of shared files. Before I left the University to work, I transfered a complete set of users’ file lists to my home computer for later analysis. With 1215 XML file lists from DC++, I wrote a few perl scripts to calculate metrics on the 600mb data set.

Interestingly, the DC++ hub appears to still be around at its old redirect address thchub.no-ip.com:3307. Apparently a student r253141224 is hosting the service on his dorm computer 128.253.141.224.

Data From 20,000 Feet

From the file lists I have, there were 2,456,462 unique files, 5,424,446 total files, 19.07 unique terabytes, and 75.55 total terabytes. Here’s a histogram and data listing of the most popular file types:

file-types-histogram.jpg

mp3	1857432
jpg	828815
m4a	312173
png	264820
gif	224034
avi	203304
dll	133889
wma	116851
htm	82130
zip	79114

The file types follow a classic long-tail distribution, and let us query the data in more interesting ways. For example, for avi movie files, what were the most popular file names? Here’s the top 20:

crash.avi	90
pulp fiction.avi	76
garden state.avi	74
office space.avi	74
good will hunting.avi	72
wedding crashers.avi	67
sin city.avi	66
lost - 2x05 - ...and found.avi	65
super troopers.avi	63
zoolander.avi	60
robin hood - men in tights.avi	59
lost - 2x09 - what kate did.avi	58
eternal sunshine of the spotless mind.avi	57
lost - 2x04 - everybody hates hugo.avi	57
memento.avi	57
american beauty.avi	55
batman begins.avi	55
mean girls.avi	55
lost - 2x07 - the other 48 days.avi	54
old school.avi	54

We can take advantage of common patterns in the data to try and find other patterns, but I’ll save that for another day, and another post in what will undoubtably become a series.

 

Trackbacks

(Trackback URL)

close Reblog this comment
blog comments powered by Disqus