Likewise, Quantium, the organisation that helped UBank build the People Like U application, has argued it only uses “de-identified data”. Woolworths, which now owns a 50 percent share of Quantium, also makes similar claims.
Australia’s Privacy Act currently only covers personal information, effectively exempting de-identified data and its extensive use by Australian corporations.
But data experts warn we don’t yet know if data aggregation is an effective way of protecting private information, nor do we have a clear concept of just what is personally identifiable data.
“If a company comes to you and says, ‘Look, we have removed all personally identifiable information from these data sets’, the claims should be taken very skeptically,” says Tiberio Caetano, principal researcher of NICTA’s machine learning group.
“Whenever people talk about aggregating data in order to avoid breaches of privacy, over and over again we discover you can actually unbury identities of people in those situations.”
In one well-known case in the US, researchers from the University of Texas were able to re-identify individual Netflix subscribers by using a data set of 500,000 “anonymous” Netflix subscribers that had rated movies online, and matching it against a separate set of public data.
Consumers never really know what happens with their data once it is collected by a company, says NICTA researcher Arik Friedman, who specialises in privacy and data mining.
“Maybe they don’t do anything with it, maybe they’re just using it internally, or maybe they’re using it in conjunction with data collected from other sources and combining it to learn more about us, using information in ways that are not really clear to us.”
UBank’s People Like U combines Australian Bureau of Statistics Census data with NAB and UBank customers’ “de-identified” transaction records, meaning that names, addresses and account numbers are removed.
However it asks users to input their gender, age range, income range, living situation, post code and whether they rent or own their home.
Depending on the size of the postcode they live in, it doesn’t take long to drastically reduce the number of people a consumer is comparing themselves with.
UBank stops at “less than 10” in order to make it harder for individuals to be identified, but data experts say this could be reduced to as few as less than three by combining it with readily available data, such as the electoral roll or white pages.
“Just by visiting this website you’re also providing some information to UBank, so you learn that a person from this IP address probably has these demographics,” Friedman says.
Friedman says popular retail loyalty schemes such as that offered by Woolworths and Coles simply provide data aggregators with more information that could lead to a consumer’s information being used in a different context than originally intended.
“This makes learning things about you or re-identifying your information in other places easier – it just provides more data to work with.”
Share and share alike
NAB confirmed with iTnews that it shares transaction data with Quantium.*
And Quantium itself boasts on its website that it utilises client data, often combined with external data sources and its own proprietary data, to “maximise the insights available from a multi-dimensional view”.
In addition to NAB, Quantium counts banks Westpac and ANZ, and health insurance providers Medibank and Bupa as clients.
It doesn’t take long to work out the implications for combining health insurance data with retail spending data says Caetano.
“Suddenly they know what you eat and they can price the premium of my insurance in a much more personalised way which may be good or bad, we don’t know.”
Caetano says data should be thought of as a "projection” of who consumers are.
For example, he says, through interaction with one particular merchant a consumer may give one projection of themselves, and via interaction with another merchant they may give a complimentary projection.
“If someone happens to have access to both those things, you can really improve the picture of who they are.
“You can collect fully aggregated data and several other fully aggregated data sets that include a person and you can often tease out the identity.”
Not so personal
Even if data is “de-identified”, Caetano says there’s still plenty of information that can be gleaned about individuals by joining together certain pieces of information.
“If you tell me your gender, immediately I can tell you a lot of things about you. If you tell me you are female I can tell you the chances of you being over 1.9 metres high are smaller than one percent.
“Now if you tell me your gender and in addition to that you tell me your age, I can improve estimates of your height.
“Once you start increasing projections of who you are, we start to have a very accurate picture of who you are, even if I do not know your identity.”
Friedman agrees. “You can still learn a lot about a person just from knowing the fact that they belong to a subset of a population.”
*An earlier version of this story stated that UBank shares information with Woolworths. NAB has since said while it shares data with Quantium it does not share the data with Woolworths.
Read on for the data scientist’s view of what’s required next..
Understanding the problem
As consumers continue to trade privacy for convenience, organisations are getting better at targeted promotions. Is this really such a bad thing?
The problem, Friedman admits, is privacy protection technologies are still catching up with the data analysis process.
“There is a conflict between the desire to extract useful knowledge and still maintain privacy of individuals.
“It’s very hard to strike this balance and in some cases we have to accept there might be some privacy loss if we want to get a benefit from using the data for good.”
Friedman cites medical research as one field that has been grappling with this issue for years. More recently though, it is how Australia’s largest companies use the increasing amount of data being made publicly available that will throw up new questions about just what is “good”.
Caetano says the central idea behind big data will be one of ethics, not technology.
“It turns out ethics needs to be brought to the core of every single discussion of modern society because we are now observing the data is not just about stuff, the data is about people.”
In this environment, Caetano says, consumers are unable to anticipate what one seemingly small piece of data they release about themselves could ultimately lead to.
“Whenever any piece of data about yourself is being collected, you are not really releasing only information that is directly associated with that piece of data, but any potential information that is correlated with that particular piece of data.”
Caetano says this is the new paradigm that both consumers and businesses need to understand.
“Given this understanding and this awareness we need to thinking how we can build a society, build mechanisms, institutions, organisations and legislation that actually fully embrace and understand this truth, rather than pretending it doesn’t exist.”