AOL Gate: Search Query Data Scandal
Techcrunch notes that AOL has released a file containing 20,000,000 queries from “anonymized” users. However, this is a problem because anything those users typed into AOL search–social security numbers, names, drug deals, etc can be cross-correlated to expose their identities. Imagine a politician ego-searching then browsing asian pornography? The scandal would just be beginning.

AOL smartly took down the download link, but once released on the web, it will always be on the web. To that end, we’re hosting the data here on our bandwidth-limited downloads platform: AOL-data.tgz. If you get in, you should get a decently fast speed.
According to Adam D’Angelo, the reason AOL published the data was for recognition in the search-engine research arena:
This was not a leak - it was intentional. In their desperation to gain recognition from the research community, AOL decided they would compromise their integrity to provide a data set that might become often-cited in research papers: “Please reference the following publication when using this collection: G. Pass, A. Chowdhury, C. Torgeson, ‘A Picture of Search’ The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.” is the message before the download.
Here’s a breakdown of the core facts:
- 20,000,000 queries from 650,000 users in 2GB uncompressed tab-delimited files
- Uncensored queries for three months of AOL search service, spring 2006
- Essentially public domain
- Contains dangerous private information
Update
The data is rife with all kinds of personally identifiable data. For example, a quick grep for credit-card patterns produces the following:
grep -i -e “[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}” *.txt
- 9006-0512-xxxx-xxx
- 1550-0905-xxxx-xxxx
Looking for Social Security Numbers (SSN) turns up this HUGE amount of data:
grep -i -e “\b[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}\b” *.txt
- kristy nicole vega hammond la. social secruity number 437-67-xxxx birth date 03 08 xx drivers license number la. 00765xxxx address 41178 rene dr. hammond la.
- pamela button 079-60-xxxx
- thomas j finney socsec 370-40-xxxx
- 419-94-xxxx thomas black
- 458-87-xxxx seguro social
- social security number 545-29-xxxx
- ssn 436-47-xxxx
I’ve censored the personal information, but there are about 200 entries of social security numbers in the test data. Searching for things that look email addresses ([a-zA-Z0-9_\-]*@[a-zA-Z0-9_\-]*\.) turns up another 60 or so.
Update 2:
If you want to get this data into a more usable form, say MySQL, try this (note that we’re not going to bother storing duplicate queries, but you might want to):
mysql> CREATE TABLE aoldata (anonid int unsigned not null, query varchar(255), querytime datetime, itemrank int unsigned, clickurl varchar(255), PRIMARY KEY(anonid, query))
Then you just need to import it, as appropriate:
LOAD DATA LOCAL INFILE ‘user-ct-test-collection-01.txt’
INTO TABLE aoldata
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
(anonid, query, querytime, itemrank, clickurl);
Other Blogs
Paul notes that the AOL data is really Google data, since AOL search is rebranded Google. Zoli has the post that started it all.
This entry was posted on Monday, August 7th, 2006 at 3:50 am and is tagged with social security numbers, asian pornography, search engine research, core facts, card patterns, research arena, service spring, drug deals, d angelo, query data, grep, aol, kristy, search service, search query, desperation, research papers, birth date, public domain, politician. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback.
29 Responses to 'AOL Gate: Search Query Data Scandal'
Leave a Reply
Fresh, related resources:
- AOL Gate: Search Query Data Scandal
He found an undisclosed amount of credit card numbers, about 200 social security numbers and nearly 60 email addresses. He shows how easily it was to acquire the data using simple search queries. read more | digg story. - AOL Privacy Scandal--Internet Beware
Independent data sources have just started going over the data, but this fellow--http://elliottback.com/wp/archives/2006/08/07/aol-gate-search-query-data-scandal/ --claims that there are at least 200 different Social Security numbers in ...... - AOL Data MySQL Statements
AOL Gate: Search Query Data Scandal by Elliott Back. Techcrunch notes that AOL has released a file containing 20000000 queries from “anonymized” users. However, this is a problem because anything those users typed into AOL search–social ... - AOL Gate: Search Query Data Scandal
aol search, reason aol, asian pornography, queries, tab delimited files, social security numbers, scandal, search engine research, query data, research arena, numbers names, drug deals, fast speed, and tgz. - AOL Just Did the Unthinkable - Boycott AOL?
... exposes data; we've got a little sick feeling; AOL releases search data on 500k users… and then tries to take it back; More AOHell; Forget The Government, AOL Exposes Search Queries To Everyone; AOL Gate: Search Query Data Scandal ......

on August 7th, 2006 at 5:16 am
Come Monday morning AOL is going to find themselves in one hell of a predicament. =)
on August 7th, 2006 at 8:57 am
[…] Elliott Back shows other non-academic uses of the database. The database contains credit card numbers, security numbers, e-mails and other information that should’ve been filtered before the data was released to public. […]
on August 7th, 2006 at 10:30 am
AOL Just Did the Unthinkable - Boycott AOL?…
(Updated)Thank you, Google for resisting the DOJ’s effort to obtain user search data. You put up a good fight to protect our privacy, and you won. Too bad it was all in vain.
AOL, in blatant violation of its users privacy just released the log of 3 m…
on August 7th, 2006 at 11:32 am
Yes, AOL just broke the trust with its members. We are planning to setup a public searchable database for “victims”.
on August 7th, 2006 at 11:44 am
Your create table statement won’t work because it expects the primary keys to be unique.
I recommend this instead:
CREATE TABLE aoldata (anonid int unsigned not null, query varchar(255), querytime datetime, itemrank int unsigned, clickurl varchar(255));
And after you load the data run:
CREATE INDEX aoldata_index on aoldata (anonid,query);
on August 7th, 2006 at 1:25 pm
It will work just fine if you choose the IGNORE option on your LOAD statement, which I’ve done
on August 7th, 2006 at 3:42 pm
Ce n’est pas bon, mon cher ami!
on August 7th, 2006 at 4:14 pm
[…] Es handelt sich hierbei um über 20 Millionen Suchanfragen von bereits erwähnten 650.000 AOL-Usern. Diese wurden anonymisiert, jedoch mit ein wenig Geschick findet man Sozialversicherungs- bzw. Kreditkartennummern heraus (siehe Blog Eliott Back). In diesem Blog finden sich außerdem noch zahlreiche Informationen über den Datenimport zur weiteren Verwendung. […]
on August 7th, 2006 at 5:21 pm
[…] This has been seen as a huge mistake on AOL part and raises some serious privacy concerns. The AOL user IDs in the data had been replaced by random numbers, but there was still quite a bit of private data in the search queries such as Social Security Numbers and credit card numbers. See Elliot Back’s post about the privacy issues. […]
on August 7th, 2006 at 7:08 pm
[…] Elliott Back shows other non-academic uses of the database. The database contains credit card numbers, security numbers, e-mails and other information that should’ve been filtered before the data was released to public.” […]
on August 7th, 2006 at 10:15 pm
[…] elliottback.com/wp/archives/2006/08/07/aol-gate-search-query-data-scandal/”>Elliot Back’s piece […]
on August 7th, 2006 at 10:17 pm
[…] Elliot Back’s piece […]
on August 8th, 2006 at 4:44 pm
We imported the data into a database, enabled fulltext searching and are making it avaliable for everyone to use and check out. Here is the url simplifiedsec.com/KeywordDigger.html
Let me know if you have any suggestions on it to get more relavent for you.
on August 9th, 2006 at 5:29 am
Updates on the AOL Scandal…
Aol released private search queries of 500.000 AOL users to the public. Analysts and Journalists alike are having a busy time analysing the data for various reasons. I was able to identify three different motivations: 1. How big is the privacy breach, …
on August 9th, 2006 at 6:18 am
[…] Zoli’s Blog revealed the disaster on Aug 6, the day it occurred. Elliot Back did some quick egreps to find approximately 200 social security numbers. Researchers identified at least one anonymous AOL user. Since AOL is now a re-branding of Google search marketers are using the data to refine AdSense ventures (ring tone, ring tones, ring tones for cell phone, ring tones for cell phone, ring tones garth brooks…). […]
on August 9th, 2006 at 9:45 am
[…] Elliot Black shows that a huge amount of social security numbers were included in the AOL data. Some more examples of the search keywords and phrases that could cause privacy problems can be found here. More bloggers covering the topic can be found here and here. AOL’s accidental unleashing of hundreds of thousands of AOL customer’s private searches has already resulted in the discovery of at least one specific person. The New York Times explains how 62-year-old Thelma Arnold’s search keywords and phrases were revealed to all. No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men” to “dog that urinates on everything.” […]
on August 9th, 2006 at 11:53 am
I’ve done the credit card and SSN searches, too. Credit card numbers have a checksum. It is trivially easy to find out which ones have a chance of being real (there are about 10 different ones). It’s another story with SSNs though.
on August 10th, 2006 at 7:25 am
Thelma Arnold’s personal A-O-Hell (and caveat emp…
Run and hide, it’s IT Blogwatch, in which AOL users find their privacy compromised. Not to mention cheap retro gaming consoles (buyer beware)……
on August 12th, 2006 at 1:55 am
[…]Totalling about 500MB in compressed format, the database can be easily found on the Internet whether via direct download or through Bittorrent. Likewise, instructions on how to best get the data into usable form and creating a search form to parse the data are quickly coming available.[…]
on August 12th, 2006 at 11:55 am
A site where you can search the data is here:
www.datablunder.com/logitems/query/
on August 13th, 2006 at 11:45 pm
[…] AOL Search Data Tools List Posted in Search, Law, AOL by Elliott Back on August 13th, 2006. [Del.icio.us] If you don’t know about AOL Gate, you’ve been gone a long time. Well, the good news is that a number of searchable AOL Data databases have been released, each with its own set of unique features. This post attempts to categorize them all! […]
on August 14th, 2006 at 1:09 pm
The issue isn’t AOL’s stupidity - the issue is the Government’s ability to spy on each and everyone of us while we mill around like lambs waiting for freedom’s slaughter.
They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.
Ben Franklin
(and if someone wants to hit me with the UK terrorist crackdown, do ask yourself first: how many terrorists has the TSA captured in five years? Oh, and by the way, who checks the confiscated shampoo bottles to see if they are indeed liquid explosives)
on August 18th, 2006 at 2:34 pm
Well if nothing else this data is a great way to stress test a mysql server. I loaded all the data (not just unique entries) and then created an index for the data and it took a little over 6 hours on my dual p3 750 with 512 meg of ram (debian 3.1) oh well its fun playing with it
on August 20th, 2006 at 4:25 pm
A *quick* site where you can search the AOL Logs for yourself, is here:
www.frogspy.com
on August 22nd, 2006 at 5:53 pm
Thank you for making the search data avaialbale for download.
on August 28th, 2006 at 3:35 am
[…] In the meantime, execs were busy scrambling, backpedalling or both, while attempting to assure the public that the data had been deidentified. And just how deidentified was the data? On Wednesday, August 9, 2006, New York Times journalists Michael Barbaro and Tom Zeller Jr. introduced us to one of the many so-called deidentified individuals in their article, “A Face Is Exposed for AOL Searcher No. 4417749.” One gentleman, Elliot Bäck, found everything from credit cards to social security numbers among the so-called deidentified records. […]
on September 6th, 2006 at 12:32 am
And I have the whole database as do many other thosaunds of people now! :O
Not that’ll I’ll do anything with it apart from browse through but lots of damage will be done by some!
on September 7th, 2006 at 11:56 pm
For anybody who is interested in HOW TO apply the AOL-Data files to SEO for legitimate purposes, check out www.aol-data.com. Also offers discussions on the ethics, and a whole lot of other links and info on SEO in general.
on October 4th, 2006 at 3:42 pm
[…] But this did catch my eye…hopefully not another AOL Gate! Personal information Activity within SearchMash. Some features of SearchMash may enable you to interact with search results beyond simply clicking through a result or navigating to another page of results. We may record the use of these features in a non-personally identifiable manner to evaluate their usefulness. Posted by Full 1511 | […]