Knock me down with a feather. As a techy blogging bystander, I’ve been watching as the non-technical media world figures out the implications of the NSA’s metadata mining operation. Latest and best example of this promising trend can be found over at Slate. Dahlia Lithwick, the star legal journalist there, has an informative piece explaining how metadata can be used to infer lots of information about you.
Lithwick gets extra points for noting that there’s other metadata on the Interbytes that the intelligence agencies can use to make even more precise inferences. And they don’t even have to cross the line–not that this is an issue for the NSA–into scooping up Internet traffic. There’s indirect public information on social media sites that, as we’ve pointed out here and here, can be used to make reasonably good guesses about behaviors, preferences, education, income, beliefs, etc.
Lithwick doesn’t quite lay it all out, but the bigger picture is that the metadata mining of phone numbers and IP addresses builds the social graph; the NSA’s scooping up of social media metadata and other auxiliary content fills out a potential subject’s attributes. When you put the two pieces together, you’ve created … a Social Attribute Network or SAN, which is essentially a (somewhat accurate) inference engine that can answer, in theory, questions, such as, “How likely is it that Bob Jones owns a 3-D printer for making explosive devices?
Maybe she’ll get to that in another post. However, she does have a very useful link to Princeton University’s Edward Felten, a professor of computer science and public affairs. His web page is very much worth exploring. While wandering around there, I came across an academic paper that has raised my paranoia even further–thanks Prof. Felten.
Over on my other blog show, I write about all the new forms of something called PII or personally identifiable information. Think of PII as essentially social security number, name, address, phone number, etc.–all the obvious identifiers that you wouldn’t want to be publicly connected with any personal or sensitive information about you. Over the years, there’s been a blurring of lines between these standard identifiers and anonymous information. A new class of quasi-identifiers have emerged that can encompass, for example, geo-coordinates and timestamps, facial images, and most famously, birth date and zip code, all of which can be linked back to a specific individual.
In Felten’s paper, he shows that completely anonymized aggregated data–which I always assumed was as anonymous as you get–can also reveal private information. Yikes! The main culprit is the massive amounts of preference information that’s collected by the likes of Amazon, Netflix, Last.fm and other e-retail sites. Through their APIs, these sites make available(!) to developers and marketers a table or matrix of correlation information between items in their inventory–i.e., Harry Potter customers are 60% likely to also buy World of Warcraft, but only 1% likely to read “50 Shades of Grey”. You can think of this table as the underlying inference infrastructure.
Felten and his co-authors took these correlation numbers and then by scanning for public preference declarations that we all make on social media sites–Twitter, Facebook, and e-retail sites with social forums–was able to derive a few consumers’ likely purchasing history. Not a bad day’s work considering they using completely public information.
As a public policy issue, there are disturbing questions about whether these preference tables should be made public in the first place. In fact, a few years back Netflix was forced (by the FTC) to remove a public dataset of movie preference lists of anonymized subscribers–so called user-to-item lists–from its website. Two researchers, Arvind Narayanan and Vitaly Shmatikov, proved that is was possible to re-identify preferences data for a few individuals. Now Felten and his co-authors, who include the same Narayanan and Shmatikov, have run experiments indicating that even aggregated item-to-item relationships reveal clues.
The overall and dispiriting takeaway is to be careful about what you say publicly on Twitter and the rest of the blabosphere. These tidbits can be connected to massive datasets that are partially made public through APIs and ultimately allow marketers, hackers, and information brokers to learn more about you then you would have thought possible.
The FTC, by the way, is looking into some of these Internet-era privacy issues and potential abuses. Dahlia, if you’re reading, that might make a good post as well.
Photo credit: Sashataylor