In the booming world of Big Data, consumers, governments, and even companies are rightfully concerned about the protection and security of their data and how to keep one’s personal and potentially embarrassing details of life from falling into nefarious hands. At the same time, most would recognize that Big Data can serve a valuable purpose, such as being used for lifesaving medical research and to improve commercial products. A question therefore at the center of this discussion is how, and if, data can be effectively “de-identified” or even “anonymized” to limit privacy concerns – and if the distinction between the two terms is more theoretical than practical. (As I mentioned in a prior post, “de-identified” data is data that has the possibility to be re-identified; while, at least in theory, anonymized data cannot be re-identified.)
Privacy of health data is particularly important and so the U.S. Health Insurance Portability and Accountability Act (HIPPA) includes strict rules on the use and disclosure of protected health information. These privacy constraints do not apply if the health data has been de-identified – either through a safe harbor-blessed process that removes 18 key identifiers or through a formal determination by a qualified expert, in either case presumably because these mechanisms are seen as a reasonable way to make it difficult to re-identify the data. But even the federal government admits that data that has been de-identified in accordance with these standards still “retains some risk of identification” explaining: “Although the risk is very small, it is not zero, and there is a possibility that de-identified data could be linked back to the identity of the patient to which it corresponds.” Although interestingly, the government’s explanation of the HIPAA rules does not make a clear distinction between “de-identification” and “anonymized”, as defined above.
In fact, in a 2010 paper, Mark A. Rothstein argued that “the use of deidentified health information and biological specimens in research creates a range of privacy and other risks to individuals and groups”. Rothstein suggested that research using de-identified health records could still cause groups of individuals to be stigmatized by conclusions based on race, gender and other factors. De-identification, he writes, “does not usually remove information about an individual’s membership in certain groups defined by race, ethnicity, gender, religion, or other criteria.” Or in other words, decreasing the risk of identification of individuals almost intentionally increases the identification of groups.
As noted above, there is an ongoing debate as to whether data can ever really become anonymized and, therefore, not able to be re-identified. For example, Gretchen McCord pointed to a 2000 study and examples of re-identified data to demonstrate what she says is the “failure of anonymization to protect privacy” and “the degree of falseness in our sense of security in our online privacy, even when our names are not directly or publicly linked with our online activities.” In one such example, she writes:
In 2006, AOL responded to the growing interest in open research by releasing an “anonymized” set of 20,000,000 search queries input by 650,000 users. AOL replaced PII such as names and IP addresses with unique identifier numbers, which were needed, of course, to be able to connect search queries made by the same user for the purpose of researching online behavior.
As researchers combed through the data, stories began to develop, one of the most startling being the user who searched for the phrases “how to kill your wife,” “pictures of dead people,” and “car crash photo.” Eventually, one researcher identified an individual, Thelma Arnold of Lilburn, Georgia, based on her combined searches: “homes sold in shadow lake subdivision gwinnett county georgia,” “landscapers in Lilburn, Ga,” and several searches for people with the last name “Arnold.”
Similarly, in a Forbes article published in 2016, Kalev Leetaru argued that “many in the data science community I’ve met still cling to the belief that just replacing usernames with random numbers somehow magically secures a dataset against any possible reidentification.” Interestingly, a separate post on Privacy Analytics directly challenged the Leetaru article, claiming that “[n]one of the examples in his article are examples of anonymization” and that “there are people and organization[s] that anonymize data effectively every day – but they don’t make the news like these sensationalized stories.”
A 2014 EU working party opinion reviewed anonymization techniques and generally found that they can be effective. But maybe the most interesting, and what really should be the biggest takeaway for anyone seeking to use aggregated de-identified or anonymized data, was the acknowledgement in the opinion that an “anonymised dataset can still present residual risks to data subjects” and “anonymisation should not be regarded as a one-off exercise and the attending risks should be reassessed regularly by data controllers.“