*** Welcome to piglix ***

K-anonymity


k-anonymity is a property possessed by certain anonymized data. The concept of k-anonymity was first introduced by Latanya Sweeney in a paper published in 2002 as an attempt to solve the problem: "Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful." A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release. The various procedures and programs for generating anonymised data providing k-anonymity protection have been patented in the United States (Patent 7,269,578).

In the context of k-anonymization problems, a database is a table with n rows and m columns. Each row of the table represents a record relating to a specific member of a population and the entries in the various rows need not be unique. The values in the various columns are the values of attributes associated with the members of the population. The following table is a nonanonymized database consisting of the patient records of some fictitious hospital in Kochi.

There are 6 attributes and 10 records in this data. There are two common methods for achieving k-anonymity for some value of k.

The next table shows the anonymized database.

This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes. The attributes available to an adversary are called "quasi-identifiers". Each "quasi-identifier" tuple occurs in at least k records for a dataset with k-anonymity.

Meyerson and Williams (2004) demonstrated that optimal k-anonymity is an NP-hard problem, however heuristic methods such as k-Optimize as given by Bayardo and Agrawal (2005) often yields effective results. A practical approximation algorithm that enables solving the k-anonymization problem with an approximation guarantee of O(log k) was presented by Kenig and Tassa.

Because k-anonymization does not include any randomization, attackers can still make inferences about data sets that may harm individuals. For example, if the 19-year-old John from Kerala is known to be in the database above, then it can be reliably said that he has either cancer, a heart-related disease, or a viral infection.


...
Wikipedia

...