, SecurityFocus 2007-12-04
Story continued from Page 1
In the latest potential privacy breach, the UT Austin researchers found that they could create an algorithm to use a person's public movie ratings -- in this case, from the Internet Movie Database (IMDb) -- to find if they had rated movies included in the Netflix data set. The researchers found that only two to eight movies needed to be common between the two to detect whether the person was included in the Netflix database. In fact, with eight movies and review dates that have as much as a 14-day error, 99 percent of the records could be uniquely identified in the Netflix data set, the researchers stated in their paper.
"Very little auxiliary information is needed to de-anonymize an average subscriber record from the Netflix Prize dataset," they wrote.
The two people found using the researchers' algorithm were "exceptionally strong matches," the researchers stated in their paper. The second-best candidate for matching the IMDb users selected by the researchers were 28 and 15 standard deviations away. The researchers only used a few dozen records from IMDb because they did not want to violate the site's terms of service.
The Netflix Prize training data consists of approximately 100 million ratings of nearly 18,000 movies from more than 480,000 people, each whose was given a random ID to keep their identity a secret. The data accounts for about one-eighth of Netflix's data collected from October 1998 to December 2005.
The researchers stressed that the results not only apply to movies but any data set that includes a small number items per person from a larger selection of goods. A person's visits to Internet sites, their shopping cart at Amazon.com, or their musical preferences could be used to distinguish them from other consumers.
"When you leave a grocery store, the stuff that is in your cart, is completely unlike the shopping cart of any other person in the world," said UT Austin's Narayaran.
Moreover, the results could link together two identities maintained by the same person. While many online criminals use the strategy to compartmentalize their criminal activity, separating online identities has been suggested by privacy experts as a way to keep, for example, a person's work and personal data distinct from each other. Yet, if a person performs any common activity between the two identities -- reviewing movies, shopping online, and even going to same set of sites -- the individual risks having his identities, whether work and home or public and private, linked.
Whether the release of a large data set can be considered a breach of privacy is still unanswered. For one, the researchers did not define a breach of privacy as the ability to identify a person and their sensitive data -- essentially the legal definition -- but whether any information leaked out about an individual's non-public data, a more rigorous mathematical definition.
Moreover, Netflix's data by itself does not give any information about its users, said BT Counterpane's Schneier.
"Netflix alone did nothing wrong," he said. "It's only when you combine their data with the IMDb data that there is a breach."
Netflix did not respond to requests for comment on the issue.
UPDATE: The article's description was edited to clarify that only the ratings in the Netflix Prize dataset can currently be matched using publicly available movie reviews.
If you have tips or insights on this topic, please contact SecurityFocus.