, SecurityFocus 2007-12-04
In a dramatic demonstration of the privacy dangers of databases that collect consumer habits, two researchers from the University of Texas at Austin have shown that a handful of movie ratings can identify a person as easily as a Social Security number.
The researchers -- graduate student Arvind Narayanan and professor Vitaly Shmatikov, both from the Department of Computer Sciences at the University of Texas at Austin -- claim to have identified two people out of the nearly half million anonymized users whose movie ratings were released by online rental company Netflix last year. The company published the large database as part of its $1 million Netflix Prize, a challenge to the world's researchers to improve the rental firm's movie-recommendation engine.
"Releasing the data and just removing the names does nothing for privacy," Shmatikov told SecurityFocus. "If you know their name and a few records, then you can identify that person in the other (private) database."
While Netflix's dataset did not include names, instead using an anonymous identifier for each user, the collection of movie ratings -- combined with a public database of ratings -- is enough to identify the people, the researchers argued in a paper published soon after Netflix released the data, but which only recently came to light. Narayanan and Shmatikov demonstrated the danger by using public reviews published by a "few dozen" people in the Internet Movie Database (IMDb) to identify movie ratings of two of the users in Netflix's data.
Exposing movie ratings that the reviewer thought were private could expose significant details about the person. For example, the researchers found that one of the people had strong -- ostensibly private -- opinions about some liberal and gay-themed films and also had ratings for some religious films.
More generally, the research demonstrated that information that a person believes to be benign could be used to identify them in other private databases. In privacy and intelligence circles, the result has been understood for decades, but the University of Texas paper visually demonstrates the dangers, said Bruce Schneier, founder and chief technology officer of managed security provider BT Counterpane.
"Even as early as decades as go, the U.S. government would classify aggregates of information, (because) you can take unclassified data and put them together to get something that is not unclassified," Schneier said.
Last year, America Online's chief technology officer resigned after a massive dataset of 20 million searches performed by 658,000 people was published for use in research. The data was believed to be anonymized, but revealed sensitive details of the searchers private lives, including Social Security numbers, credit-card numbers, addresses, and, in one case, apparently a searcher's intent to kill their wife.
Privacy worries have heightened in the past few years following a number of data breaches that have leaked sensitive information on millions of people. In November, the head of HM Revenue & Customs, the United Kingdom's tax agency, resigned after two data discs containing sensitive, yet unencrypted, personal details of 25 million U.K. citizens were lost in the mail. In January, retail giant TJX Companies announced that data thieves had stolen the credit- and debit-card details on, what currently is estimated to be, more than 94 million consumers.