Anonymous - we need some limits

I've been much annoyed recently by the wooliness of thinking around anonymity by many and thought it time to remind folks to look the word up in a dictionary before declaring that what they have done is "anonymized the data".

Anonymous:

not identified by name; of unknown name
having no outstanding, individual, or unusual features

The recent example of the New York taxi data serves as a key example - the translation of the taxi hack license number (identifying the driver) and taxi medallion (identifying the vehicle) through an MD5 hash function simply changes the names - well check defn. 1. The underpinning data, however, was highly individual - fail on defn. 2. More accurately this should have been described as de-identified data, but anyone who had recorded their own journey in a cab could use it to start to re-identify the cabby and vehicle by correleating their journey with the open data. Say someone set up a website to share that information - we would have re-identification by crowd sourcing.

(In fact the failure was worse than that - the de-identification algorithm only had 18.4m possibilities for drivers and 1.3m possibilities for vehicles. No human would want to take the time to sift through that lot, but to the computer it's done in the blink of an eye.)

Much fuss was kicked up about care.data recently; many factors contributed here, but one aspect that often receives attention is this aspect of re-identification. I'm sure it is annoying to the authors of the riveting page turner "Anonymisation Standard for Publishing Health and Social Care Data Specification" who have gone to considerable lengths to consider a vast array of possible re-identification attacks, but those still concerned point out that we can't conceive of all of the future data sources that could be correlated to enable re-identification and so the risk is too high to accept.

It does call out for a new angle of research on the problem - can we achieve some theoretical limit for a given anonymization technique about what granularity of re-identifcation would be possible.

We've been looking recently for inspiration from some of our related work within Horizon which has looked the predictability in human mobility from GPS traces (open access PDF). The work aims to provide a limit to what could possibly be achieved in terms of mobility prediction no matter how cunning we can be in future. And of course location data is in itself one of the most concerning privacy violating data sources that many people unknowingly continually stream to random third parties from their smart phones.

In any case we need to:

remind ourselves about the definition of anonymous regularly
somehow get beyond the unquantified risk argument...

Written on July 30, 2014