Elliott C. Back: Internet & Technology

Wikipedia Statistics From WikiCharts

Posted in Computers & Technology, SEO, Web 2.0 by Elliott Back on August 30th, 2006.

If you wanted to know what’s hot on Wikipedia, check out WikiCharts. The newtool shows the articles from the English Wikipedia that are viewed most, but is still in testing and may report wrong results. Here’s what it claims are this month’s top 20 posts:

1831000  3% 	4.2065% 	1. Main Page
62250  14% 	0.1430% 	2. Wikipedia
52500  15% 	0.1206% 	3. United States
49750  16% 	0.1143% 	4. JonBenet Ramsey
39500  17% 	0.0907% 	5. List of big-bust models and performers
36500  18% 	0.0839% 	6. Hurricane Katrina
36250  19% 	0.0833% 	7. Irukandji jellyfish
35250  18% 	0.0810% 	8. Pluto
33250  19% 	0.0764% 	9. Wiki
30750  20% 	0.0706% 	10. Jeff Hardy
30500  20% 	0.0701% 	11. List of sex positions
30000  19% 	0.0689% 	12. World Wrestling Entertainment roster
29750  20% 	0.0683% 	13. List of female porn stars
28750  20% 	0.0660% 	14. Wii
27000  21% 	0.0620% 	15. Pokemon
25750  21% 	0.0592% 	16. Pornography
24500  22% 	0.0563% 	17. Neighbours
24250  22% 	0.0557% 	18. Celebrity sex tape
22500  24% 	0.0517% 	19. Volkswagen Type 2
22500  24% 	0.0517% 	20. Priyanka Chopra

My only question–what are “Irukandji jellyfish” and why are they so popular? Can you EAT them?!

The Sketchier Side of This Domain

Posted in Adsense, Blogging, Computers & Technology, My Blog, SEO, Search, Spam by Elliott Back on August 28th, 2006.

A commenter on Scoble’s post complained about four of my sites, calling them spammy and auto-generated. Ignoring some of off-context commentary on me as a person, Matt wrote:

I wouldn’t blame Scoble that much, Elliot’s [sic] homepage links to a number of really spammy looking things:

vioxx.elliottback.com/ (that’s the worst)

msn-icons.elliottback.com/main.php

credit-card-information.elliottback.com/

celebrity-photos.elliottback.com/

universities.elliottback.com/

He seems to be trying to automate the creation of a ton of content pages, take advantage of WP’s natural search engine advantage, and then use the trust from his domain (from the software he writes) to cash in via the really obnoxious adsense everywhere. Google seems to have indexed almost a million pages on his site. Probably not the type of content that Google wants their ads next to, though.

To address these concerns, I am about to give a breakdown of all the subdomains and properties I own into four categories: Respectable Blogs, Niche Blogs, Online Tools, and Automatic Experiments. You might be surprised by the breakdown–most of the websites that I am toying with are not automated in any way.

Respectable Blogs

These are hand-written, original blogs on a variety of topics. While you might not consider gossip and celebrity photos to be interesting to you, our editors do their best to populate them with interesting commentary and posts:

Niche Blogs

These are blogs which I write content for, not because I love them, but to earn revenue. They cover topics I find interesting enough to create original content and share ideas for, but they are not my way of expressing myself. In other words, these blogs are just business.

Online Tools

Every now and then, I get a crazy idea. I want to try something out–like a new platform for photo sharing via Gallery 2 (the MSN Icons site) or how to parse credit card information (the CC site). So, I build a site, plaster it with ads when I’m done, and see what comes of it. These are just fun projects for me, toys to play with. No one visits them, and I hardly make revenue off them.

Automatic Experiments

I have three ongoing experiments into automation. The first is WP-Autoblog, which I am using to syndicate posts from search engines on Vioxx, essentially turning my site into a meta-search engine on that topic. I’m using attributed excerpts to avoid any legal or ethical issues. The second project is Eye My Spam, a blog that goes straight from email to blog post without any filtering. Since no one uses the email address for communication with me, that blog essentially posts spam from my inbox straight to the web, useful for archival and public information sharing purposes. Now you can google a piece of email and see yes, it is indeed spam. The third project is the unreleased Infinite Tree project, which I’m still working on. Basically, it’s just an aggregator based around keywords.

Conclusion

I help this hopes you readers sort out exactly what I do–harmless dabbling, some serious blogging, and some for-profit stuff. I’m not interested in blog automation research that hurts anyone. My policy on internet techniques is not to be jealous of some one else’s software or business model, provided it falls within the law, but rather to be open to changes in the way people view the web. Is a syndicator dangerous? Yes. Does it provide a paradigm shift when used correctly? Yes. That’s why I wrote WP-Autoblog–to give people control over content and sourcing. It can be abused, but it can also be used to create useful directories of links, or create a meta-blog of blogs you manage.

AOL Gate: Search Query Data Scandal

Posted in AOL, Blogging, Google, Law, SEO, Search, Spam by Elliott Back on August 7th, 2006.

Techcrunch notes that AOL has released a file containing 20,000,000 queries from “anonymized” users. However, this is a problem because anything those users typed into AOL search–social security numbers, names, drug deals, etc can be cross-correlated to expose their identities. Imagine a politician ego-searching then browsing asian pornography? The scandal would just be beginning.

aolgate.jpg

AOL smartly took down the download link, but once released on the web, it will always be on the web. To that end, we’re hosting the data here on our bandwidth-limited downloads platform: AOL-data.tgz. If you get in, you should get a decently fast speed.

According to Adam D’Angelo, the reason AOL published the data was for recognition in the search-engine research arena:

This was not a leak – it was intentional. In their desperation to gain recognition from the research community, AOL decided they would compromise their integrity to provide a data set that might become often-cited in research papers: “Please reference the following publication when using this collection: G. Pass, A. Chowdhury, C. Torgeson, ‘A Picture of Search’ The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.” is the message before the download.

Here’s a breakdown of the core facts:

  • 20,000,000 queries from 650,000 users in 2GB uncompressed tab-delimited files
  • Uncensored queries for three months of AOL search service, spring 2006
  • Essentially public domain
  • Contains dangerous private information

Update

The data is rife with all kinds of personally identifiable data. For example, a quick grep for credit-card patterns produces the following:

grep -i -e “[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}” *.txt

  • 9006-0512-xxxx-xxx
  • 1550-0905-xxxx-xxxx

Looking for Social Security Numbers (SSN) turns up this HUGE amount of data:

grep -i -e “\b[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}\b” *.txt

  • kristy nicole vega hammond la. social secruity number 437-67-xxxx birth date 03 08 xx drivers license number la. 00765xxxx address 41178 rene dr. hammond la.
  • pamela button 079-60-xxxx
  • thomas j finney socsec 370-40-xxxx
  • 419-94-xxxx thomas black
  • 458-87-xxxx seguro social
  • social security number 545-29-xxxx
  • ssn 436-47-xxxx

I’ve censored the personal information, but there are about 200 entries of social security numbers in the test data. Searching for things that look email addresses ([a-zA-Z0-9_\-]*@[a-zA-Z0-9_\-]*\.) turns up another 60 or so.

Update 2:

If you want to get this data into a more usable form, say MySQL, try this (note that we’re not going to bother storing duplicate queries, but you might want to):

mysql> CREATE TABLE aoldata (anonid int unsigned not null, query varchar(255), querytime datetime, itemrank int unsigned, clickurl varchar(255), PRIMARY KEY(anonid, query))

Then you just need to import it, as appropriate:

LOAD DATA LOCAL INFILE ‘user-ct-test-collection-01.txt’
INTO TABLE aoldata
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
(anonid, query, querytime, itemrank, clickurl);

Other Blogs

Paul notes that the AOL data is really Google data, since AOL search is rebranded Google. Zoli has the post that started it all.

« Previous PageNext Page »