AOL Gate: Search Query Data Scandal
Techcrunch notes that AOL has released a file containing 20,000,000 queries from “anonymized” users. However, this is a problem because anything those users typed into AOL search–social security numbers, names, drug deals, etc can be cross-correlated to expose their identities. Imagine a politician ego-searching then browsing asian pornography? The scandal would just be beginning.

AOL smartly took down the download link, but once released on the web, it will always be on the web. To that end, we’re hosting the data here on our bandwidth-limited downloads platform: AOL-data.tgz. If you get in, you should get a decently fast speed.
According to Adam D’Angelo, the reason AOL published the data was for recognition in the search-engine research arena:
This was not a leak - it was intentional. In their desperation to gain recognition from the research community, AOL decided they would compromise their integrity to provide a data set that might become often-cited in research papers: “Please reference the following publication when using this collection: G. Pass, A. Chowdhury, C. Torgeson, ‘A Picture of Search’ The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.” is the message before the download.
Here’s a breakdown of the core facts:
- 20,000,000 queries from 650,000 users in 2GB uncompressed tab-delimited files
- Uncensored queries for three months of AOL search service, spring 2006
- Essentially public domain
- Contains dangerous private information
Update
The data is rife with all kinds of personally identifiable data. For example, a quick grep for credit-card patterns produces the following:
grep -i -e “[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}” *.txt
- 9006-0512-xxxx-xxx
- 1550-0905-xxxx-xxxx
Looking for Social Security Numbers (SSN) turns up this HUGE amount of data:
grep -i -e “\b[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}\b” *.txt
- kristy nicole vega hammond la. social secruity number 437-67-xxxx birth date 03 08 xx drivers license number la. 00765xxxx address 41178 rene dr. hammond la.
- pamela button 079-60-xxxx
- thomas j finney socsec 370-40-xxxx
- 419-94-xxxx thomas black
- 458-87-xxxx seguro social
- social security number 545-29-xxxx
- ssn 436-47-xxxx
I’ve censored the personal information, but there are about 200 entries of social security numbers in the test data. Searching for things that look email addresses ([a-zA-Z0-9_\-]*@[a-zA-Z0-9_\-]*\.) turns up another 60 or so.
Update 2:
If you want to get this data into a more usable form, say MySQL, try this (note that we’re not going to bother storing duplicate queries, but you might want to):
mysql> CREATE TABLE aoldata (anonid int unsigned not null, query varchar(255), querytime datetime, itemrank int unsigned, clickurl varchar(255), PRIMARY KEY(anonid, query))
Then you just need to import it, as appropriate:
LOAD DATA LOCAL INFILE ‘user-ct-test-collection-01.txt’
INTO TABLE aoldata
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
(anonid, query, querytime, itemrank, clickurl);
Other Blogs
Paul notes that the AOL data is really Google data, since AOL search is rebranded Google. Zoli has the post that started it all.
This entry was posted on Monday, August 7th, 2006 at 3:50 am and is tagged with social security numbers, asian pornography, search engine research, core facts, card patterns, research arena, service spring, drug deals, d angelo, query data, grep, aol, kristy, search service, search query, desperation, research papers, birth date, public domain, politician. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback.

Add New Comment
Viewing 15 Comments
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Add New Comment
Trackbacks
(Trackback URL)
8/7/2006 at 8:57 am
[...] Elliott Back shows other non-academic uses of the database. The database contains credit card numbers, security numbers, e-mails and ...
8/7/2006 at 10:30 am
AOL Just Did the Unthinkable - Boycott AOL?... (Updated)Thank you, Google for resisting the DOJ's effort to obtain user search data. ...
8/7/2006 at 4:14 pm
[...] Es handelt sich hierbei um über 20 Millionen Suchanfragen von bereits erwähnten 650.000 AOL-Usern. Diese wurden anonymisiert, jedoch mit ...
8/7/2006 at 5:21 pm
[...] This has been seen as a huge mistake on AOL part and raises some serious privacy concerns. The AOL ...
8/7/2006 at 7:08 pm
[...] Elliott Back shows other non-academic uses of the database. The database contains credit card numbers, security numbers, e-mails and ...
8/7/2006 at 10:15 pm
[...] http://elliottback.com/wp/archives/2006/08/07/aol-gate-search-query-data-scandal/”>Elliot Back’s piece [...]
8/7/2006 at 10:17 pm
[...] Elliot Back’s piece [...]
8/9/2006 at 5:29 am
Updates on the AOL Scandal... Aol released private search queries of 500.000 AOL users to the public. Analysts and Journalists alike ...
8/9/2006 at 6:18 am
[...] Zoli’s Blog revealed the disaster on Aug 6, the day it occurred. Elliot Back did some quick egreps to ...
8/9/2006 at 9:45 am
[...] Elliot Black shows that a huge amount of social security numbers were included in the AOL data. Some more ...
8/10/2006 at 7:25 am
Thelma Arnold's personal A-O-Hell (and caveat emp... Run and hide, it's IT Blogwatch, in which AOL users find their privacy compromised. ...
8/13/2006 at 11:45 pm
[...] AOL Search Data Tools List Posted in Search, Law, AOL by Elliott Back on August 13th, ...
8/28/2006 at 3:35 am
[...] In the meantime, execs were busy scrambling, backpedalling or both, while attempting to assure the public that the data ...
10/4/2006 at 3:42 pm
[...] But this did catch my eye…hopefully not another AOL Gate! Personal information Activity within SearchMash. Some features ...