So can we stop now with the fantasy that data can be anonymized?
Two things sparked this train of thought. The first was seeing that researchers at the Mayo Clinic have shown that commercial facial recognition software accurately identified 70 of a sample set of 84 (that’s 83%) MRI brain scans. For ten additional subjects, the software placed the correct identification in its top five choices. Yes, on reflection, it’s obvious that you can’t scan a brain without including its container, and that bone structure defines a face. It’s still a fine example of data that is far more revealing than you expect.
The second was when Phil Booth, the executive director of medConfidential, on Twitter called out the National Health Service for weakening the legal definition of “anonymous” in its report on artificial intelligence (PDF).
In writing the MRI story for the Wall Street Journal (paywall), Melanie Evans notes that people have also been reidentified from activity patterns captured by wearables, a cautionary tale now that Google’s owner, Alphabet, seeks to buy Fitbit. Cautionary, because the biggest contributor to reidentifying any particular dataset is other datasets to which it can be matched.
The earliest scientific research on reidentification I know of was Latanya Sweeney‘s 1997 success in identifying then-governor William Weld’s medical record by matching the “anonymized” dataset of records of visits to Massachusetts hospitals against the voter database for Cambridge, which anyone could buy for $20. Sweeney has since found that 87% of Americans can be matched from just their gender, date of birth, and zip code. More recently, scientists at Louvain and Imperial College found that just 15 attributes can identify 99.8% of Americans. Scientists have reidentified individuals from anonymized shopping data, and by matching mobile phone logs against transit trips. Combining those two datasets identified 95% of the Singaporean population in 11 weeks; add GPS records and you can do it in under a week.
This sort of thing shouldn’t be surprising any more.
The legal definition that Booth cited is Recital 26 of the General Data Protection Regulation, which specifies in a lot more detail about how to assess the odds (“all the means likely to be used”, “account should be taken of all objective factors”) of successful reidentification.
Instead, here’s the passage he highlighted from the NHS report as defining “anonymized” data (page 23 of the PDF, 44 of the report): “Data in a form that does not identify individuals and where identification through its combination with other data is not likely to take place.”
I love the “not likely”. It sounds like one of the excuses that’s so standard that Matt Blaze put them on a bingo card. If you asked someone in 2004 whether it was likely that their children’s photos would be used to train AI facial recognition systems that in 2019 would be used to surveil Chinese Muslims and out pornography actors in Russia. And yet here we are. You can never reliably predict what data will be of what value or to whom.
At this point, until proven otherwise it is safer to assume that that there really is no way to anonymize personal data and make it stick for any length of time. It’s certainly true that in some cases the sensitivity of any individual piece of data – say your location on Friday at 11:48 – vanishes quickly, but the same is not true of those data points when aggregated over time. More important, patient data is not among those types and never will be. Health data and patient information are sensitive and personal not just for the life of the patient but for the lives of their close relatives on into the indefinite future. Many illnesses, both mental and physical, have genetic factors; many others may be traceable to conditions prevailing where you live or grew up. Either way, your medical record is highly revealing – particularly to insurance companies interested in minimizing their risk of payouts or an employer wishing to hire only robustly healthy people – about the rest of your family members.
Thirty years ago, when I was first encountering large databases and what happens when you match them together, I came up with a simple privacy-protecting rule: if you do not want the data to leak, do not put it in the database. This still seems to me definitive – but much of the time we have no choice.
I suggest the following principles and assumptions.
One: Databases that can be linked, will be. The product manager’s comment Ellen Ullman reported in 1997 still pertains: “I’ve never seen anyone with two systems who didn’t want us to hook them together.”
Two: Data that can be matched, will be.
Three: Data that can be exploited for a purpose you never thought of, will be.
Four: Stop calling it “sharing” when the entities “sharing” your personal data are organizations, especially governments or commercial companies, not your personal friends. What they’re doing is *disclosing* your information.
Five: Think collectively. The worst privacy damage may not be to *you*.
The bottom line: we have now seen so many examples of “anonymized” data that can be reidentified that the claim that any dataset is anonymized should be considered as extraordinary a claim as saying you’ve solved Brexit. Extraordinary claims require extraordinary proof, as the skeptics say.
Addendum: if you’re wondering why net.wars skipped the 50th anniversary of the first ARPAnet connection: first of all, we noted it last week; second of all, whatever headline writers think, it’s not the 50th anniversary of the Internet, whose beginnings, as we wrote in 2004, are multiple. If you feel inadequately served, I recommend this from 2013, in which some of the Internet’s fathers talk about all the rules they broke to get the network started.
Illustrations: Monty Python performing the Spanish Inquisition sketch in 2014 (via Eduardo Unda-Sanzana at Wikimedia.
Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Stories about the border wars between cyberspace and real life are posted occasionally during the week at the net.wars Pinboard – or follow on Twitter.