Social media and our too-much-information online culture has brought new life to an old privacy vulnerability. The kind of privacy loophole I’m referring to has actually been around pre-Internet. This old idea is to use a few known and relatively unique personal attributes to match against other data, usually public in nature. One can with very high likelihood find your man or woman. It’s a technique not unheard of in detective work.
In the Internet epoch, these kinds of de-identification attacks have been receiving new scrutiny from regulatory agencies. In 2009, the FTC persuaded Netflix to not release an anonymized movie rating dataset for their well publicized algorithm contest. The regulators based their decision on an experiment performed by two University of Texas computer scientists who re-identified an older public dataset of Netflix movie rating–essentially long rows of 1-5 ratings–with no personal information.
The researchers succeeded: their algorithm found the complete Neflix movie ratings of several users by matching against public and self-identified movie ratings from the IMDb movie fan site.
With the publication of a new research paper, the stakes have become even higher.
Keith Ross, the Leonard J. Shustek Chair Professor in Computer Science at the Polytechnic Institute of NYU, along with colleagues Ratan Dey and Yuan Ding, were able to identify a high percentage of the student body of several US high schools using Facebook profiles. What turns this theoretical attack into a call for action is how easy it was to link under 18-year old Facebook users, whose public profiles are minimal and high school affiliations blocked, to a specific high school and graduating class.
The technique is similar in some respects to the Netflix example. Unfortunately, Facebook, Twitter, et. al., have made the job of the hacker all the easier with their online troves of semi-private information.
In this particular experiment, Federal law, in the form of the Children’s Online Protection Act (COPPA), does play a role in controlling and limiting personal data of children 12-years and younger. Web sites that accept minors as users are required to get explicit consent for the personal data, including verified parental approval.
Rather than having to comply with COPPA, Facebook altogether bans these legal minors. However, users between the ages of 13 and 18 are considered by Facebook to be registered minors. To its credit, Facebook reveals minimal information–name, photo, gender–from this group to a stranger (non “friend’) performing a public search. More on this point later.
So how did the NYU researchers identify the high-school class details of these registered minors?
The researchers exploited what is to my eye, a serious flaw in Facebook’s protections. High school students can claim they are older than 18 and and still state a graduating class in the future–say 2014 or 2015. Facebook doesn’t catch this probable contradiction. Facebook allows the profile of these “adult” high-schoolers to reveal more information than registered minors, including (by default), hometown high-school affiliation, and friends lists. And the last attribute is Ross and his co-researchers’ foot in the door.
Their algorithm, somewhat simplified, is this: collect all the friends associated with a core set of adult users associated with a given school, and assume there a few who are true minors. From this larger candidate pool, rank the names by the number of times they are referenced from the core set–those who appear on more than one friends list of the core are more likely to be in that high-school.
The algorithm was turned into real software for crawling Facebook profiles. Working with three high schools, Ross was able to obtain enrollment data for benchmarking purposes. For one high-school class with 352 students, the software discovered 18 relevant seed profiles–i.e, “adults” with public friends lists. From these seeds, and using the friends lists, a pool of over 6000 potential students were generated! If you suspected that students spend too much time online, the study provides solid evidence–that’s over 300 friends per student.
As I explained above, Ross’s software then ranks these candidates, and depending on the threshold value that’s set for how far down the list to explore, their stalk-ware will produce more and more potential students, though with increasing “false positives”. I’ve listed a table of results (above) Ross collected based on an attack on the aforementioned high school. Ross’s extended algorithm–you can read the paper to see his natural tweaks–ultimately found 175 students in the first 200 names in the ranked list. Which is scarily good, unfortunately.
Keep in mind that these 175 students are Facebook minors (13-18 years old) whose location and high-school affiliations are not made public.
I had quibbles with the paper’s conclusions about the adverse affects of COPPA. Thankfully, I also had the chance to talk to Professor Ross about his research, and I’ll release an edited transcript next week.
My thoughts: I think children will always lie about their age to gain additional adult rights whereas the report finds that in COPPA-less world no online lying would occur. But it does seem reasonable, as the report notes, that with the hassles of COPPA, pre-teens will, for convenience, misstate their age to enter the world of Facebook, ultimately increasing the pool of “adults” when this group enters high-school.
There’s no perfect solution, but there is a real problem with Facebook’s carelessness in protecting children’s personal data.