Wikipedia Statistics From WikiCharts
If you wanted to know what’s hot on Wikipedia, check out WikiCharts. The newtool shows the articles from the English Wikipedia that are viewed most, but is still in testing and may report wrong results. Here’s what it claims are this month’s top 20 posts:
1831000 3% 4.2065% 1. Main Page 62250 14% 0.1430% 2. Wikipedia 52500 15% 0.1206% 3. United States 49750 16% 0.1143% 4. JonBenet Ramsey 39500 17% 0.0907% 5. List of big-bust models and performers 36500 18% 0.0839% 6. Hurricane Katrina 36250 19% 0.0833% 7. Irukandji jellyfish 35250 18% 0.0810% 8. Pluto 33250 19% 0.0764% 9. Wiki 30750 20% 0.0706% 10. Jeff Hardy 30500 20% 0.0701% 11. List of sex positions 30000 19% 0.0689% 12. World Wrestling Entertainment roster 29750 20% 0.0683% 13. List of female porn stars 28750 20% 0.0660% 14. Wii 27000 21% 0.0620% 15. Pokemon 25750 21% 0.0592% 16. Pornography 24500 22% 0.0563% 17. Neighbours 24250 22% 0.0557% 18. Celebrity sex tape 22500 24% 0.0517% 19. Volkswagen Type 2 22500 24% 0.0517% 20. Priyanka Chopra
My only question–what are “Irukandji jellyfish” and why are they so popular? Can you EAT them?!
The Sketchier Side of This Domain
A commenter on Scoble’s post complained about four of my sites, calling them spammy and auto-generated. Ignoring some of off-context commentary on me as a person, Matt wrote:
I wouldn’t blame Scoble that much, Elliot’s [sic] homepage links to a number of really spammy looking things:
vioxx.elliottback.com/ (that’s the worst)
msn-icons.elliottback.com/main.php
credit-card-information.elliottback.com/
celebrity-photos.elliottback.com/
He seems to be trying to automate the creation of a ton of content pages, take advantage of WP’s natural search engine advantage, and then use the trust from his domain (from the software he writes) to cash in via the really obnoxious adsense everywhere. Google seems to have indexed almost a million pages on his site. Probably not the type of content that Google wants their ads next to, though.
To address these concerns, I am about to give a breakdown of all the subdomains and properties I own into four categories: Respectable Blogs, Niche Blogs, Online Tools, and Automatic Experiments. You might be surprised by the breakdown–most of the websites that I am toying with are not automated in any way.
Respectable Blogs
These are hand-written, original blogs on a variety of topics. While you might not consider gossip and celebrity photos to be interesting to you, our editors do their best to populate them with interesting commentary and posts:
- Asia Blog
- Books Blog
- Business School Blog
- Eric Back’s Blog
- Gadgets Blog
- Elliott Back dot COM
- Video Games
Niche Blogs
These are blogs which I write content for, not because I love them, but to earn revenue. They cover topics I find interesting enough to create original content and share ideas for, but they are not my way of expressing myself. In other words, these blogs are just business.
- COMS 482 Notes
- Cornell University Blog
- Gold’s Gym Sucks
- The Gossip Rag Blog
- Hot Celebrity Photos
- University Tours
Online Tools
Every now and then, I get a crazy idea. I want to try something out–like a new platform for photo sharing via Gallery 2 (the MSN Icons site) or how to parse credit card information (the CC site). So, I build a site, plaster it with ads when I’m done, and see what comes of it. These are just fun projects for me, toys to play with. No one visits them, and I hardly make revenue off them.
Automatic Experiments
I have three ongoing experiments into automation. The first is WP-Autoblog, which I am using to syndicate posts from search engines on Vioxx, essentially turning my site into a meta-search engine on that topic. I’m using attributed excerpts to avoid any legal or ethical issues. The second project is Eye My Spam, a blog that goes straight from email to blog post without any filtering. Since no one uses the email address for communication with me, that blog essentially posts spam from my inbox straight to the web, useful for archival and public information sharing purposes. Now you can google a piece of email and see yes, it is indeed spam. The third project is the unreleased Infinite Tree project, which I’m still working on. Basically, it’s just an aggregator based around keywords.
Conclusion
I help this hopes you readers sort out exactly what I do–harmless dabbling, some serious blogging, and some for-profit stuff. I’m not interested in blog automation research that hurts anyone. My policy on internet techniques is not to be jealous of some one else’s software or business model, provided it falls within the law, but rather to be open to changes in the way people view the web. Is a syndicator dangerous? Yes. Does it provide a paradigm shift when used correctly? Yes. That’s why I wrote WP-Autoblog–to give people control over content and sourcing. It can be abused, but it can also be used to create useful directories of links, or create a meta-blog of blogs you manage.
AOL Gate: Search Query Data Scandal
Techcrunch notes that AOL has released a file containing 20,000,000 queries from “anonymized” users. However, this is a problem because anything those users typed into AOL search–social security numbers, names, drug deals, etc can be cross-correlated to expose their identities. Imagine a politician ego-searching then browsing asian pornography? The scandal would just be beginning.

AOL smartly took down the download link, but once released on the web, it will always be on the web. To that end, we’re hosting the data here on our bandwidth-limited downloads platform: AOL-data.tgz. If you get in, you should get a decently fast speed.
According to Adam D’Angelo, the reason AOL published the data was for recognition in the search-engine research arena:
This was not a leak – it was intentional. In their desperation to gain recognition from the research community, AOL decided they would compromise their integrity to provide a data set that might become often-cited in research papers: “Please reference the following publication when using this collection: G. Pass, A. Chowdhury, C. Torgeson, ‘A Picture of Search’ The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.” is the message before the download.
Here’s a breakdown of the core facts:
- 20,000,000 queries from 650,000 users in 2GB uncompressed tab-delimited files
- Uncensored queries for three months of AOL search service, spring 2006
- Essentially public domain
- Contains dangerous private information
Update
The data is rife with all kinds of personally identifiable data. For example, a quick grep for credit-card patterns produces the following:
grep -i -e “[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}” *.txt
- 9006-0512-xxxx-xxx
- 1550-0905-xxxx-xxxx
Looking for Social Security Numbers (SSN) turns up this HUGE amount of data:
grep -i -e “\b[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}\b” *.txt
- kristy nicole vega hammond la. social secruity number 437-67-xxxx birth date 03 08 xx drivers license number la. 00765xxxx address 41178 rene dr. hammond la.
- pamela button 079-60-xxxx
- thomas j finney socsec 370-40-xxxx
- 419-94-xxxx thomas black
- 458-87-xxxx seguro social
- social security number 545-29-xxxx
- ssn 436-47-xxxx
I’ve censored the personal information, but there are about 200 entries of social security numbers in the test data. Searching for things that look email addresses ([a-zA-Z0-9_\-]*@[a-zA-Z0-9_\-]*\.) turns up another 60 or so.
Update 2:
If you want to get this data into a more usable form, say MySQL, try this (note that we’re not going to bother storing duplicate queries, but you might want to):
mysql> CREATE TABLE aoldata (anonid int unsigned not null, query varchar(255), querytime datetime, itemrank int unsigned, clickurl varchar(255), PRIMARY KEY(anonid, query))
Then you just need to import it, as appropriate:
LOAD DATA LOCAL INFILE ‘user-ct-test-collection-01.txt’
INTO TABLE aoldata
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
(anonid, query, querytime, itemrank, clickurl);
Other Blogs
Paul notes that the AOL data is really Google data, since AOL search is rebranded Google. Zoli has the post that started it all.