, SecurityFocus 2007-12-04
In a dramatic demonstration of the privacy dangers of databases that collect consumer habits, two researchers from the University of Texas at Austin have shown that a handful of movie ratings can identify a person as easily as a Social Security number.
The researchers -- graduate student Arvind Narayanan and professor Vitaly Shmatikov, both from the Department of Computer Sciences at the University of Texas at Austin -- claim to have identified two people out of the nearly half million anonymized users whose movie ratings were released by online rental company Netflix last year. The company published the large database as part of its $1 million Netflix Prize, a challenge to the world's researchers to improve the rental firm's movie-recommendation engine.
"Releasing the data and just removing the names does nothing for privacy," Shmatikov told SecurityFocus. "If you know their name and a few records, then you can identify that person in the other (private) database."
While Netflix's dataset did not include names, instead using an anonymous identifier for each user, the collection of movie ratings -- combined with a public database of ratings -- is enough to identify the people, the researchers argued in a paper published soon after Netflix released the data, but which only recently came to light. Narayanan and Shmatikov demonstrated the danger by using public reviews published by a "few dozen" people in the Internet Movie Database (IMDb) to identify movie ratings of two of the users in Netflix's data.
Exposing movie ratings that the reviewer thought were private could expose significant details about the person. For example, the researchers found that one of the people had strong -- ostensibly private -- opinions about some liberal and gay-themed films and also had ratings for some religious films.
More generally, the research demonstrated that information that a person believes to be benign could be used to identify them in other private databases. In privacy and intelligence circles, the result has been understood for decades, but the University of Texas paper visually demonstrates the dangers, said Bruce Schneier, founder and chief technology officer of managed security provider BT Counterpane.
"Even as early as decades as go, the U.S. government would classify aggregates of information, (because) you can take unclassified data and put them together to get something that is not unclassified," Schneier said.
Last year, America Online's chief technology officer resigned after a massive dataset of 20 million searches performed by 658,000 people was published for use in research. The data was believed to be anonymized, but revealed sensitive details of the searchers private lives, including Social Security numbers, credit-card numbers, addresses, and, in one case, apparently a searcher's intent to kill their wife.
Privacy worries have heightened in the past few years following a number of data breaches that have leaked sensitive information on millions of people. In November, the head of HM Revenue & Customs, the United Kingdom's tax agency, resigned after two data discs containing sensitive, yet unencrypted, personal details of 25 million U.K. citizens were lost in the mail. In January, retail giant TJX Companies announced that data thieves had stolen the credit- and debit-card details on, what currently is estimated to be, more than 94 million consumers.
In the latest potential privacy breach, the UT Austin researchers found that they could create an algorithm to use a person's public movie ratings -- in this case, from the Internet Movie Database (IMDb) -- to find if they had rated movies included in the Netflix data set. The researchers found that only two to eight movies needed to be common between the two to detect whether the person was included in the Netflix database. In fact, with eight movies and review dates that have as much as a 14-day error, 99 percent of the records could be uniquely identified in the Netflix data set, the researchers stated in their paper.
"Very little auxiliary information is needed to de-anonymize an average subscriber record from the Netflix Prize dataset," they wrote.
The two people found using the researchers' algorithm were "exceptionally strong matches," the researchers stated in their paper. The second-best candidate for matching the IMDb users selected by the researchers were 28 and 15 standard deviations away. The researchers only used a few dozen records from IMDb because they did not want to violate the site's terms of service.
The Netflix Prize training data consists of approximately 100 million ratings of nearly 18,000 movies from more than 480,000 people, each whose was given a random ID to keep their identity a secret. The data accounts for about one-eighth of Netflix's data collected from October 1998 to December 2005.
The researchers stressed that the results not only apply to movies but any data set that includes a small number items per person from a larger selection of goods. A person's visits to Internet sites, their shopping cart at Amazon.com, or their musical preferences could be used to distinguish them from other consumers.
"When you leave a grocery store, the stuff that is in your cart, is completely unlike the shopping cart of any other person in the world," said UT Austin's Narayaran.
Moreover, the results could link together two identities maintained by the same person. While many online criminals use the strategy to compartmentalize their criminal activity, separating online identities has been suggested by privacy experts as a way to keep, for example, a person's work and personal data distinct from each other. Yet, if a person performs any common activity between the two identities -- reviewing movies, shopping online, and even going to same set of sites -- the individual risks having his identities, whether work and home or public and private, linked.
Whether the release of a large data set can be considered a breach of privacy is still unanswered. For one, the researchers did not define a breach of privacy as the ability to identify a person and their sensitive data -- essentially the legal definition -- but whether any information leaked out about an individual's non-public data, a more rigorous mathematical definition.
Moreover, Netflix's data by itself does not give any information about its users, said BT Counterpane's Schneier.
"Netflix alone did nothing wrong," he said. "It's only when you combine their data with the IMDb data that there is a breach."
Netflix did not respond to requests for comment on the issue.
UPDATE: The article's description was edited to clarify that only the ratings in the Netflix Prize dataset can currently be matched using publicly available movie reviews.
If you have tips or insights on this topic, please contact SecurityFocus.