Published on the 31/07/2019 | Written by Heather Wright
Research shows anonymised data isn’t really anonymous…
If we needed any further evidence that privacy is something of a myth in this digital age, new research has shown that removing personal details from data doesn’t actually stop individuals being identified.
In what researchers from Imperial College London and the University of Louvain in Belgium say should be a wakeup call for policymakers, they’ve shown that ‘anonymised’ datasets, which remove personal details from data, can be reverse engineered to identify the individuals.
The researchers developed an algorithm and used machine learning and discovered that it could be used with unnerving accuracy to re-identify people from supposedly anonymous data.
“Our paper shows that de-identification is nowhere near enough to protect the privacy of people’s data.”
Their research found that 99.98 percent of Americans would be correctly re-identified in any dataset using 15 demographic attributes, such as date of birth, marital status and gender.
First author Dr Luc Rocher of UCLouvain says “While there might be a lot of people who are in their thirties, male and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car and live with two kids (both girls) and one dog.”
The research has prompted the authors to call for policymakers to do more to protect individuals and to tighten the rules around anonymous data.
“Our paper shows that de-identification is nowhere near enough to protect the privacy of people’s data,” says co-author Julien Hendrickx from UCLouvain.
“Our results suggest that even heavily sampled anonymised datasets are unlikely to satisfy the modern standards for anonymisation set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model,” the paper says.
The researchers say the paper demonstrates that allowing data to be used – to train AI algorithms for example – while preserving people’s privacy, requires much more than simply adding noise, sampling datasets and other de-identification techniques.
Governments and companies – from medical facilities to financial services – routinely collect personal data. The paper notes that even academic journals are increasingly requiring authors to make anonymous data available to the research community.
Data is sampled and anonymised, stripping identifying characteristics such as names and email addresses in the belief it will ensure individuals can’t be identified.
Once that’s done, data protection laws worldwide consider the ‘anonymous’ data as not being personal data anymore and therefore able to be freely used, shared and sold to third parties such as advertising companies and data brokers.
But the report shows that once acquired, that data can often be reverse engineered using machine learning to re-identify individuals – potentially exposing sensitive information about people and enabling the buyers to build comprehensive personal profiles.
It questions the validity of anonymisation as a means of addressing privacy concerns in the wake of scandals such as the sale of Facebook data to Cambridge Analytica, and raises questions about the global market for data.
Senior author Dr Yves-Alexandre de Montjoye from Imperial’s Department of Computing and Data Science Institute, says the research shows just how easily, and accurately, individuals can be traced.
“Companies and governments have downplayed the risk of re-identification by arguing that the datasets they sell are always incomplete.
“Our findings contradict this and demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for.”
The researchers have also published a demonstration tool to allow people to understand how likely they are to be traced. The tool, they note, does not save data.