Scholars at Princeton University have delivered a stinging rebuke to the 'big data' movement, insisting that today's data de-identification tools are not sufficient to ensure privacy.
Assistant Professor Arvind Narayanan and Professor Edward Felten have published an academic paper titled 'No silver bullet: De-identification still doesn't work [pdf]’ which ridicules the research methodologies of a paper published last month by ITIF researcher Daniel Castro and Ontario privacy commissioner Ann Cavoukian, which had concluded the opposite.
Marketers across the globe are building tools to take advantage of the large volumes of data created by web sites and the use of digital devices such as smartphones, with half an eye on a new generation of embedded digital devices in everything from automobiles to consumer goods.
Steady improvements in computer processing power and distributed file systems have helped marketers gain far richer insights into large data sets at far greater speed.
While most agree that big data tools provide economic and social value, theorists are split on whether there are sufficient legal and technical frameworks available to ensure that insights can be drawn from large, aggregated sets of data without impeding on an individual’s right to privacy.
Data de-identification, the storing and sharing of data in such a way that the identity of any one individual can’t be ascertained from the broader data set, has become one of the latest battlegrounds on the subject.
Castro and Cavoukian’s June paper concluded that the risk of re-identification of an individual from a de-identified data set has been “greatly exaggerated”, inflamed by researchers that haven’t used tools effectively and blown out of proportion by the media.
“Contrary to what misleading headlines and pronouncements in the media almost regularly suggest, datasets containing personal information may be de-identified in a manner that minimises the risk of re-identification, often while maintaining a high level of data quality."
The authors argued that much of this research had either failed to provide enough proof that individuals could be re-identified from the data sets queried; had matched the data with third party data sets to achieve their aims; or had applied specialist knowledge that isn’t readily available in the ‘real world’. They were concerned that policy makers may feel compelled to regulate in ways that reduce the utility of these data sets.
"While data is typically collected for a single purpose, increasingly it is the many different secondary uses of the data wherein tremendous economic and social value lies. For example, recent studies have shown that large-scale mobile phone data can help city planners and engineers better understand traffic patterns and thus design road networks that will minimize congestion. De-identifying the data is one way to enable its reuse by third parties.
While it is not possible to guarantee that de-identification will work 100 percent of the time, it remains an essential tool that will drastically reduce the risk of personal information being used or disclosed for unauthorized or malicious purposes."
In reply, Narayanan and Felten methodically pulled apart this defence of the efficacy of big data de-identification, listing eight problems with the June paper and insisting that data de-identification is ‘no silver bullet’ to ensuring privacy.
"There is no evidence that de-identification works either in theory or in practice and attempts to quantify its efficacy are unscientific and promote a false sense of security by assuming unrealistic, artificially constrained models of what an adversary might do."
The Princeton scholars’ paper listed many examples of where a motivated actor could readily combine aggregated data sets from mobile network connections or web site clicks with information easy to obtain from elsewhere to identify an individual.
It further squashed the notion that only a handful of individuals have the tools to re-identify data, citing the tens of millions of people qualified in software development as more than capable.
"Most “anonymized” datasets require no more skill than programming and basic statistics to de- anonymize."
The scholars contend that organisations need to invest their efforts in emerging techniques, such as differential privacy, and be prepared to make some trade-offs in utility and convenience in the interests of privacy.
In the absence of better alternatives, they argue that policy makers may have to “use legal agreements to limit the flow and use of sensitive data.”
Australian organisations are subject to the Privacy Act, which compels organisations to gain consent from users and be explicit about what reasons they are collecting PII data at the point of collection.
The regulatory function set-up to regulate these activities, however, was part of the Office of the Information Commissioner, which has been disbanded by the Abbott Government. With few resources available, Australia’s Privacy Commissioner has yet to wield new powers granted under amendments to the Privacy Act introduced in March.