July 26, 2022
Volume 20, issue 3

Download PDF version of this article PDF

Privacy of Personal Information

Going incog in a goldfish bowl

Sutapa Mondal,
Mangesh S. Gharote,
and Sachin P. Lodha
TCS Research, Tata Consultancy Services Ltd.

Most people go through life identified by name and gender. These are their public attributes, and they are usually willing to reveal them. In certain contexts (e.g., in a doctor's office), they may disclose personal details such as age, height, and weight even though these attributes may not be generally known in public. A doctor may know details about a patient's body and mind that even the patient may not know (or understand), but that patient may not welcome an attempt by a friend, even a close friend, to find out about their medical condition. Similarly, a doctor's interest in a patient's political beliefs may be inappropriate.

Some personal information may need to be made available to certain groups of people and still be protected from unwanted view. The groups could include medical investigators—say, those studying effective forms of treatment or tracking the spread of communicable diseases, who need such data for their research. Privacy remains important because this unwanted view is just the tip of the iceberg as far as misuse of personal information is concerned, and this creates a tension between the obvious advantages of such work and the potential disadvantages of the details falling into the wrong hands.

One way to protect privacy is not to reveal anything at all, but that is hardly practical. Another way is to define the contexts in which certain attributes may be revealed. Your family may know a great deal more about you than most other people do, but you expect such personal information to remain confined to members of the family. People usually have no objection to being part of national statistics in terms of age, gender, city, state, first language, and so on, because they may believe that this kind of information cannot be used to reveal identity: In other words, the information is sufficiently anonymized. But can such anonymization be compromised?

The notion of privacy has evolved over time. In 1977, Tore Dalenius articulated a desideratum stating that nothing about an individual should be learnable from a database that cannot be learned without access to the database.⁸ He raised concerns about the disclosure of data from statistical datasets. In 1980, Ruth Gavison emphasized that the privacy of an individual can be viewed as "hiding in the crowd."¹¹ Later, Helen Nissenbaum introduced the contextual integrity framework for privacy.¹⁶ It stated that information gathering and dissemination be appropriate to the context and comply with the norms governing the distribution of that information. These viewpoints became the foundation of the privacy-preserving techniques discussed here.

Privacy's Grand Challenge

Swara regularly visited Asha Hospital for medical treatment. She was told about a mobile healthcare app that would enable her to see the doctors' visiting schedules, book appointments, and make online payments. After entering her personal details, she could view her medical history and laboratory reports from past examinations. For Swara, the app provided a convenient way to interact with the hospital and manage her health records—and it helped the hospital to administer, manage, and serve its patients.

A closer look at the data stored by Asha Hospital

Asha Hospital uses a database to maintain the records of its patients. The database consists of multiple tables that capture patients' personal information, medical history, and other details required by the hospital. Figure 1 shows a database table with patients' details, such as NID (national ID number), name, ethnicity,⁴ date of birth, gender, postal code, marital status, and the diagnosed disease. Attributes such as NID and name can uniquely identify any patient and hence are referred to as PII (personally identifiable information). Disease is a sensitive attribute because people typically do not like to make their ailments public.

Privacy of Personal Information: Going incog in a goldfish bowl

The data stored in the hospital database is of tremendous importance as it can serve several purposes, such as studying and improving the efficacy of medications and monitoring and limiting the spread of diseases. Unauthorized disclosure of medical data, however, can violate patient privacy, and the violation can have financial, mental, and social impacts. For example, some diseases such as HIV or Covid may carry a social stigma, and disclosure can severely affect the ability of the patient to lead a normal life. PII and other personal data can be used for identity theft, and this may take a great deal of time to remedy. Leaks of financial information, such as credit card details, can lead to fraudulent online payments. Thus, both security and privacy of data are critical.

SWARA WONDERS: Is it safe to share personal details on this web app? Apart from information about the disease, why is other information being collected as well? Who has access to this data? What if the hospital were to share her data with a third party?

Betaal, the Malevolent

Betaal is a person with malicious intent who is on the prowl for sensitive information. Let's look at the varying degrees of leaks with which Betaal could intrude on Swara's privacy:

If Betaal learns that Swara's record is in the dataset, the leak is known as MSD (membership disclosure). This can have serious consequences, especially if she is suffering from an illness that she does not want to be made public. Similarly, MSD in a credit defaulter's list could have a severe effect on the concerned individual's ability to get loans or execute financial transactions.
If Betaal learns about the disease that Swara suffers from, the leak is known as SAD (sensitive attribute disclosure). With SAD, any ambiguity left with MSD is removed. Since Betaal knows the actual value(s) of the sensitive attribute(s), he may use this information to harm Swara. Thus, the consequences of SAD can be more severe than those of MSD.
If Betaal has access to Swara's entire record in the dataset, the leak is known as IDD (identity disclosure), the highest degree of disclosure, and it can have even worse consequences than those of MSD and SAD. For example, IDD can result in identity theft and allow Betaal to use Swara's personal information to carry out fraudulent or malicious activities on the web, knowing that any attempt to trace the perpetrator of such activities would lead the investigators to Swara. Medical identity theft would also allow Betaal to make use of healthcare services intended for Swara.

Disclosure risk

The amount of information Betaal can access can lead to different types of disclosure. In any dataset, these disclosures are closely related and can be ordered by severity as follows:

MSD << SAD << IDD

The likelihood of their occurrence follows the reverse order: The likelihood of MSD is greater than that of SAD, which in turn is greater than the likelihood of IDD. Each is subsumed in the other, in the given order. Since risk is defined as the "likelihood of disclosure" times the "impact of disclosure," then if disclosure does occur, any one of these disclosures could potentially carry a bigger risk than the other two, depending on the overall context.

SWARA WONDERS: How could Betaal access her data? Do privacy guidelines exist for sharing data?

Each disclosure allows Betaal to intrude on Swara's privacy, either directly or indirectly. Betaal could be either an authorized or unauthorized user of the system.

As an unauthorized user, he can access the data for private gain. In every organization, be it a hospital or a corporate office, there should be sufficient security controls to prevent unauthorized users from accessing data, thereby protecting personal data and preserving the privacy of an individual.
As an authorized user, Betaal could be a member of the hospital staff, but still should not be given access to all of Swara's personal data.

This raises a major question: What data can be made available to an authorized user, so (a) this user can carry out necessary work, while also (b) ensuring that Swara's privacy requirements are met?

In considering disclosure risks, it is important to note the following:

In the context of security, the user and attacker are assumed to be different; however, in the context of privacy, a legitimate user may also be an attacker.
For security, the malicious intent of an unauthorized user can be mitigated by security controls, whereas for privacy, a different set of controls is needed.
A good security framework is not enough to ensure privacy: Security takes care of the "bad guys" (i.e., unauthorized users), whereas privacy is also concerned about the "good guys" (i.e., authorized users) who have legitimate access to the data.

The data juggle

Is there some way to make only part of the data available to Betaal and mitigate the risks of MSD, SAD, and IDD?

What if PII is removed?

In figure 2, the PII attributes of NID and name are removed. In practice, the removal may be implemented as "replacement by fictitious values" to satisfy database or application constraints and requirements. This seems like a promising solution since neither the possible membership in the dataset nor the identity is revealed.

Will this preserve Swara's privacy? Sadly, that is not the case. In the late 1990s, GIC (Group Insurance Commission), which provided health insurance to Massachusetts state employees, used exactly this strategy for privacy protection. Latanya Sweeney, then a Ph.D. student at Massachusetts Institute of Technology, was able to identify most of the data subjects by taking the "join" of the publicly available voter registration data with the healthcare data provided by GIC.¹⁹ The attack resulted in identity (and membership, as well as sensitive attribute) disclosure. Therefore, removing PII alone does not guarantee privacy.

What if sensitive data is further shuffled?

Swara may not mind Betaal knowing that she is being treated in Asha Hospital—that is, she may be fine with MSD. More important for her could be to safeguard against SAD: to make sure that Betaal does not discover her disease. To overcome this, column-wise shuffling of the sensitive values (figure 3) seems to be a possible solution. This does not work, however, because even though the values are shuffled, their distribution remains the same, and that may be enough for Betaal to infer Swara's disease with high confidence. This is particularly true when the data is unevenly distributed.

What if everything is removed?

Replacing all the records with synthetic data that approximates the original data (figure 4) safeguards privacy because such data does not point to any individual in the real world. Synthetic data preserves both structural and characteristic properties of the original data. With techniques such as GAN (generative adversarial network), it is now feasible to generate synthetic data having the properties of real-world data. But its limitation is a loss in accuracy of the results obtained; hence, the approach is not always suitable for some applications. For example, machine-learning models for renal cell carcinoma prediction would require very high accuracy and precision. In such a scenario, relying completely on synthetically generated data may pose different vulnerabilities.⁵

Tug of war: privacy vs. utility

The other side of the privacy coin is data utility. There is a productive purpose behind providing data access to an authorized user. Therefore, attention should not be limited to the question, how anonymous is the anonymized data?, but should also address the other question, how meaningful is the anonymized data? The challenge of privacy is finding a balance between fully disclosed data and completely withheld data. Figure 5 shows that when data is shared "as is," its utility is maximum but privacy is minimum, and when nothing is disclosed, privacy is maximized but utility is limited.

The ideal solution would be for both privacy and utility to be at their maximum possible levels. Owing to the conflicting requirements of privacy and utility, achieving the ideal solution is challenging. There is a strong need for privacy-preserving techniques that can provide a balance between privacy and utility. Use of such techniques would allow Swara to methodically share her personal data that is useful in a specific context and prevent Betaal from breaching her privacy.

SWARA WONDERS: What are different privacy-preserving techniques? How are these techniques applied in practice?

K-anonymity: An Overview

Traditional methods such as randomization, shuffling, and data swapping have been able to preserve privacy to an extent, but the risk of data disclosure still exists. K-anonymity is a privacy-preserving technique that tends to overcome the limitations of traditional methods.

SWARA WONDERS: If Asha Hospital shares her data, will her privacy be preserved?

What if Swara's record were made to look the same as the other 99 records in the dataset? Swara's record would no longer be identifiable among those 100 records. Generalizing this 100 to k makes it difficult for Betaal to reidentify Swara among k similar records; this is like hiding Swara in the crowd. This notion of privacy, highlighted by Ruth Gavison in the 1980s, later emerged as an idea behind k-anonymization, which makes k records look similar in a dataset (i.e., data about every individual is hidden among k similar records).

SWARA WONDERS: What led to k-anonymity? How can one achieve it?

In working with healthcare data provided by GIC in Massachusetts, Sweeney observed that many patients had a unique combination of attributes in the deidentified dataset released by the hospital. Further, she observed that these three attributes were listed along with names and addresses in publicly available voter registration data. Sweeney then linked the data released by the hospital with the voter registration data and was able to reidentify sensitive health information of even the governor of Massachusetts. This is known as a linkage attack, in which records in a deidentified dataset can be reidentified by uniquely linking them with identified records in a publicly available dataset. Sweeney showed that merely removing PII is not sufficient to preserve privacy.

Sweeney also observed that with just attributes, 87 percent of the U.S. population is uniquely identifiable. Such attributes are referred to as quasi-identifiers, and when combined can act as PII. Figure 6 illustrates certain categories of attributes that are relevant to this discussion: identifiers (NID, name), quasi-identifiers (date of birth, gender, postal code), and sensitive attribute (disease).

Sweeney's linkage attack showed that publishing a record with quasi-identifier(s) is as bad as publishing it with explicit identity (PII) attributes. An attack aimed at IDD, assuming the attacker already knows that the targeted individual is part of the data, requires only the corresponding record to be identified. This assumption of prior MSD in the attack model makes it even easier for the attacker to uncover the identity of a given individual who is part of the dataset. Owing to the possibility of such attacks and limitations in the practice of personal data protection and privacy, researchers started looking for better methods. K-anonymity was introduced as a privacy-preserving technique by Samarati and Sweeney¹⁷, especially to mitigate such linkage attacks.

Published data is said to have the k-anonymity property if the information for an individual cannot be distinguished from k-1 other individuals whose information also appears in the data.¹⁷

A closer look at k-anonymity

Figure 7 illustrates an example of 3-anonymization of relational data (where k = 3) so the transformed quasi-identifiers must appear in at least three records. The transformation of quasi-identifiers can be carried out using techniques such as generalization and suppression. For suppression, some or all of the values of an attribute may be replaced by ^. In figure 7, all the values of the name attribute are suppressed. For generalization, on the other hand, individual values of attributes are replaced by a value representing a much broader range or category. For example, the postal code attribute has been generalized using a single strategy of replacing the last three digits represented by *, whereas for the date of birth attribute, multiple strategies have been used: In some cases the month has been generalized; and in others, the month, the date, and the year have been generalized. In practice, instead of * or ^ as represented here, the values are replaced by random values from the domain of the attribute values.

Generalization can be done using taxonomy trees, as illustrated in figure 8. The taxonomy tree (figure 8a) defined for the postal code shows a generalization hierarchy. The lower level of the taxonomy tree represents district information with the last three digits suppressed, whereas the higher levels depict the region and zone. Similarly, figure 8b shows a taxonomy tree for generalization of the age attribute so a value of age (say, 22) in a dataset can be generalized to any random value in the range [21-25].

Higher generalization allows more records to be mapped and hence a higher level of privacy to be achieved, although this might significantly impact data utility. Further, generalizing all records using a single strategy for an attribute may not be the best strategy. For example, if a dataset has many records with similar demographics, then grouping some records by generalizing to the region level (postal code: 41****) and others to the district level (postal code: 411***) would prove to be a better strategy for preserving privacy and enhancing utility of the data. Thus, you can make local changes rather than only global changes to the attributes.

This privacy-preserving transformation of data is referred to as recoding. In global recoding, a particular detailed value must be mapped to the same generalized value in all records. Local recoding allows the same detailed value to be mapped to different generalized values in each anonymized group.

Can optimal k-anonymity be achieved by modifying a minimum amount of data? It has been proved that achieving optimal k-anonymity is an NP (nondeterministic polynomial time) hard problem for multidimensional data.¹⁵ Various researchers have proposed approximation algorithms to achieve near-optimal k-anonymity.¹

Extensions to k-anonymity

Let's take a deeper look at the k-anonymized table (figure 9), which shows that all k (= 3) individuals with the postal code generalized to the value (520***) have the same sensitive value (in this case, they all suffer from heart disease). Although k-anonymization of data prevents linkage attacks and an attacker cannot link to other databases with a high degree of certainty, it may still reveal sensitive information. This is known as a homogeneity attack, where all k individuals have the same sensitive value.

Similarly, if an attacker has additional information about an individual, it may be possible to reidentify the record with high probability, leading to a background knowledge attack. Figure 9 shows that if an attacker knows the date of birth, postal code, and family history of a person (say, Elena), then the attacker can guess Elena's record with high probability. Thus, k-anonymity does not give any scientific guarantees against such attacks.

Moreover, the choice of k for an acceptable level of k-anonymity poses another challenge. Information is lost during generalization or suppression of records for achieving k-anonymity—the higher the generalization, the lower the utility.

To overcome some of these drawbacks, different variants of k-anonymity have been proposed. l-diversity is one such variant, in which any sensitive attribute should have l distinct values in every group. This ensures that sensitive attributes are well represented, but it also involves suppressing or adding rows that would alter the distribution of data. Such suppression or addition raises concerns over the statistical validity of conclusions drawn from the dataset.

These shortcomings lead to t-closeness, an extension of k-anonymity and l-diversity: t-closeness captures the notion that the distribution of a sensitive attribute in any k-subset is not only l-diverse but also close to the distribution of the attribute in the overall dataset. Further, the distance between the two distributions is measured by the threshold t. These extensions of k-anonymity cater to some of the limitations but not all. For example, the dimensionality of data continues to be a challenge: For data with high dimensionality such as a time series, providing the same privacy guarantees as for low-dimensional data becomes quite hard.

Many organizations have come up with privacy solutions and tools using k-anonymity as one of the privacy-preserving techniques. K-anonymity has gained much of its visibility from privacy-aware data publishing scenarios. There is also research on performing classification and data-mining tasks using k-anonymity.⁶ Further, the application scope of k-anonymity has been extended beyond relational databases to anonymizing combinatorial structures such as graphs.¹⁸

Further discussion

This section examines the choice of k in k-anonymity, some practical aspects of publishing anonymized data, quasi-identifiers, the ideal amount of generalization to achieve the desired anonymization, and how to k-anonymize efficiently.

The right choice of k

In the United States, HIPAA (Health Insurance Portability and Accountability Act) sets the standard for sensitive patient data protection. The HIPAA "safe harbor" model requires removal of geographic subdivisions smaller than a state, except for the initial three digits of the ZIP code if the geographic unit formed by combining all ZIP codes with the same initial three digits contains more than 20,000 people. Therefore, groups formed by generalizing ZIP codes with the first three digits will have at least 20,000 people.

Consequently, HIPAA defined 20,000 as the standard value of k for k-anonymity. Another U.S. act, FERPA (Family Education Rights and Privacy Act), sets the standards for protecting the personal information of students and their families. FERPA recommends the value of k as 5 or 10 to prevent disclosure. This shows how two standards—for health (HIPAA) and education (FERPA)—in the same country differ in their choice of k.

The choice of k is predefined for applications governed under these regulatory mandates. For applications with no regulatory requirement, however, choosing k to provide the right level of privacy versus utility tradeoff is a challenge. One approach to choosing k could be to vary its value over a range of values and determine the change in generalized information loss (a measure of utility) of the dataset. Then the value of k corresponding to the acceptable generalized information loss would be the appropriate choice. That said, finding the best value of k is still an open question, and researchers have proposed several approaches, such as probabilistic models and multiobjective optimization models.

Identifying quasi-identifiers

Identification of quasi-identifiers¹⁴ is a primary problem because it has a direct impact on the effectiveness of the k-anonymity method. Figure 10 illustrates the number of records identifiable for varying sets of attributes that could be potential quasi-identifiers. For example, if the table consists of only the Ethnicity column (A), then only one record (Punjabi) is unique. When column B (Gender) is added, three records become unique (identifiable). Similarly, adding column C (Postal code) makes two more records unique. Hence, with the increase in information, a larger number of records becomes identifiable.

The choice of quasi-identifiers can become more complex with an increase in the dimensionality of data. The problem also becomes more challenging with the uncertainty in the additional data being published by others. In that case, some of the published attributes must be considered as quasi-identifiers.

The ideal amount of generalization to achieve desired anonymization

The ideal degree of generalization depends on publicly available information. Public- and private-sector organizations in many countries publish information in the public domain to achieve greater transparency and to make it easier for citizens to access their data. These organizations may inadvertently publish information that should not be made available. This provides an opportunity for private aggregators to misuse such information. Hence, organizations that publish data about citizens must apply extreme generalization to prevent reidentification through linkage attacks.

Efficiently achieving k-anonymity

Researchers have demonstrated that multidimensional k-anonymity is an NP-hard problem.¹⁵ There exist, however, approximation algorithms that can achieve k-anonymity but are not scalable.¹ On the other side, a probabilistic approach to k-anonymity using dynamic programming provides a time-optimal algorithm for k-anonymity.¹⁴ Heuristic methods such as k-optimize also yield effective results.³ With the current emphasis on AI-driven analytics, however, a visible change in the definitions of privacy and data protection has taken place, which demonstrates the need for privacy-preserving techniques that provide much stronger guarantees and offer a wider scope for different applications.

Takeaways

Linkage attacks show that removing identifiers alone does not preserve privacy. Hence, k-anonymity has emerged as a prominent privacy-preserving technique. Here, generalization is performed over true information, which makes it more acceptable than other strategies. Also, k-anonymity and its variants can limit linkage, homogeneity, and background attacks. From an industrial standpoint, k-anonymity has gained wider visibility because of its acceptance in regulatory compliance, especially in the U.S.

K-anonymity does have some drawbacks, such as MSD and information loss. Also, generalization requires a taxonomy tree for every quasi-identifier in the dataset, and this requires the intervention of domain experts even if the taxonomies are auto-generated. Further, the level of generalization for every attribute may vary depending upon the use case.

This has motivated researchers to come up with purpose-driven strategies such as assigning weights to every attribute to measure their relative importance and accordingly generalize the attributes,¹⁴ or performing multidimensional suppression so values are suppressed only on certain records depending upon other attribute values.¹³ With the advancement in compute power and the availability of digital datasets, however, the risk of reidentification exists. These limitations are motivating researchers to look for better privacy-preserving techniques.

Differential Privacy: An Overview

Suppose that before sharing data, some noise is injected, or a synthetic dataset is created that has the same statistical properties as the original dataset. Then, most likely, privacy can be preserved. This section introduces DP (differential privacy) as a privacy-preserving technique.

Cynthia Dwork introduced DP to protect an individual's privacy by injecting carefully calibrated random noise to make the data nontruthful.¹⁰ The ingenuity of DP lies in allowing meaningful analyses to be drawn from the dataset while preserving the privacy of individuals. The original motivation behind DP, however, can be traced back to Dalenius's notion of privacy concerns over statistical datasets: "Learning anything about me should be hard" (i.e., it should be difficult to learn anything about an individual without direct access to the database).

SWARA WONDERS: How does differential privacy work? How does DP make it safe for third parties to perform data analysis?

In a typical DP setup, the data curator, who is assumed to be trustworthy and acts as the central body, holds the data of individuals who make up the database. With a trusted curator, DP can operate in one of two modes: an online or interactive mode; or an offline or noninteractive mode.

In the interactive mode (figure 11), the data analyst queries the database adaptively (i.e., modifies successive queries depending on the responses to earlier queries). A query is a function applied to the database—for example, how many people in the database have a drug addiction? Figure 11 shows how a DP mechanism is interposed between the database and the data analyst. Each query from an analyst generates a noisy response, thereby preserving privacy.

Another safety strategy is to operate in offline or noninteractive mode (figure 12). The curator produces a synthetic database using a DP mechanism that has the same statistical properties as the original dataset. After releasing the data, the curator plays no further role, and the original data may even be destroyed. Thus, with a synthetic database, reidentifying an individual becomes difficult. Further, such synthetic data can be shared for performing quality analyses.

SWARA WONDERS: What is a typical DP mechanism? Does DP provide any formal guarantees of privacy?

A closer look at differential privacy

Consider an algorithm that analyzes a dataset and computes statistical properties such as the mean, variance, median, and mode. Such an algorithm is said to be differentially private if, by looking at the output, you cannot tell whether any individual's data is included in the original dataset. In other words, the guarantee of a differentially private algorithm is that its behavior hardly changes with the absence or presence of an individual in the dataset. Figure 13 shows that a DP mechanism F produces similar output distributions when applied on the database with and without candidate D. Most notably, this guarantee holds for any individual and any dataset. Thus, regardless of how unique an individual's details are, and of the details of anyone else in the database, the guarantee of DP still holds.

Mathematically, DP can be defined as follows: A randomized function M gives ε-differential privacy if, for all datasets, D1 and D2 differ on at most one element, and all S ⊆ Range(M). Thus:

Pr[M(D1) ε S] ≤ exp(ε) x Pr[M(D2) ε S]

The distribution of the curator's output M(D1) on the database D1 is nearly the same as M(D2) on database D2, where the databases D1 and D2 differ by only one individual's record and M is a randomized algorithm guaranteeing ε-differential privacy: ε determines the indistinguishability of two databases D1 and D2 (i.e., the deviation in the response to a query over both the databases is governed by ε). This gives a formal guarantee that individual-level information about participants in the database will not be leaked. DP avoids MSD, as well as making it difficult for other disclosure risks (SAD and IDD) to take place.

The crucial feature of DP is that it defines privacy as a quantifiable measure using the parameter ε and not as a binary condition, such as whether the data of an individual was leaked. Essentially, ε determines how much noise is added to the computation, so it can be treated as a tuning knob for balancing privacy and utility. Each differentially private analysis can be tuned to provide more, or less, privacy.

SWARA WONDERS: How is this guarantee achieved in software systems? How is DP applied?

Differentially private algorithms, or DP mechanisms, are randomized algorithms that add noise at key points. The Laplace mechanism can be used to make aggregate queries (e.g., count, sum, means, etc.) differentially private. This uses the Laplace probability distribution centered at 0 and scaled by 1/ε to sample the random noise. Perturbing the real value by adding the obtained noise results in a masked response.

How is a simple differential privacy mechanism differentially private?

Suppose Asha Hospital provides counseling to patients with drug addiction. The hospital maintains the data of such patients collected through the healthcare app. Now, suppose Betaal—who can query this database—wants to know if Swara is receiving such counseling. Betaal can craft multiple queries as part of the differencing attack, so they reveal Swara's counseling status as "Yes" or "No." For example, Betaal uses the COUNT query, and the result is 30 (i.e., 30 people are receiving the counseling service). If the response to a second COUNT query excluding Swara's name is 29, then Betaal can conclude that Swara is receiving counseling and therefore addicted to drugs. If the second COUNT query result is 30, Betaal would conclude the opposite.

Using the interactive DP setup with the Laplace mechanism (figure 11) would mean that Betaal would always receive noisy results. The result may be centered around 29 or 30, but it may return values such as 28 or 31, or even 27 or 32 with an even smaller probability. Therefore, it is difficult for Betaal to be sure whether the true answer is 29 or 30. This goes back to the notion of Dalenius, where, despite having access to the database, Betaal's knowledge of whether Swara is receiving counseling does not change because the noisy response does not add anything to his prior knowledge.

There are many mechanisms with related algorithms that can be used instead of the Laplace mechanism: for example, an exponential mechanism, a private multiplicative weights algorithm, or a multiplicative weights exponential algorithm. With such mechanisms, DP-based software systems are possible, but there are practical challenges: For example, if the same query must always receive the same noisy response, then it requires a lookup into the log of historic responses. No leakage of information occurs since the answer remains the same, but a log lookup can be expensive in terms of space and time.

What if Betaal poses a query structurally different from but equivalent to the query he posed earlier?

Establishing the equivalence of two queries is known to be computationally hard. Therefore, while DP has several advantages over the traditional privacy-preserving approaches, there are certain limitations:

Identifying the ideal privacy loss parameter ε to have high utility while preserving privacy remains a challenge.
The privacy guarantee in DP exists for only a bounded number of queries, which is a function of the number of distinct data subjects represented in the dataset.²⁰ Thus, designing privacy-preserving mechanisms that can handle an arbitrary number of queries also remains a challenge.
DP is open to side channel attacks in which an adversary can learn facts about the data by monitoring side channels. A classic example is the timing-channel attack. Suppose a query computes the frequency of various medical diagnoses, and Betaal (the adversary) knows that the computation will take 51 μs if a person has cancer and 49 μs otherwise. Then, by merely observing the amount of time spent, Betaal can learn whether a given individual has cancer.
Sensitive attribute disclosure is possible; for example, an adversary can build a classifier over differentially private data that can predict sensitive information.⁷

So far, this article has discussed the standard DP model where the data curator is trustworthy; however, the data curator may be compromised and hence become an untrusted source. This necessitates a shift in the DP model from SDP (standard differential privacy) to LDP (local differential privacy).

In the SDP model (figure 14), the curator is trusted, and noise is injected into the original database on which the end user performs analysis. In the LDP model (figure 15), the curator is untrusted and the noise is injected locally—at the individual level for each data subject—and such perturbed data is aggregated by the untrusted curator. This way, privacy control stays with the data subject.

Further, with privacy regulations such as GDPR (General Data Protection Regulation) and CCPAA (California Consumer Privacy Act), large organizations use the LDP model to avoid the liability arising from misuse of storing sensitive user data. So, with trust assumptions, LDP is more attractive for deployment in DP-based systems. The utility of the statistics released with LDP, however, is poorer than with SDP, because in the LDP model, perturbation happens locally at each individual's end, resulting in larger noise addition.

One fallout of this model is that there is no single source of unperturbed data. Consequently, the gap between SDP and LDP can be interpreted as "high trust assumptions with high utility" in SDP and "lower trust assumptions with lower utility" in LDP. Some recent research exploits cryptographic primitives to bridge the trust-utility gap between SDP and LDP to have the best of both worlds.

Figure 16 shows that SDP can achieve high utility (lower error), while LDP does not rely on a trusted curator and achieves lower utility (high error). The goal is to achieve the utility of SDP under the more practical assumptions of LDP. Use of cryptographic primitives opens a new direction of research: to evolve DP as a promising privacy-preserving technique.²¹

Industrial outlook

DP as a technique has a much wider role to play in many application areas, including cyber physical systems (figure 17) such as smart-grid systems, healthcare systems, IoT (Internet of things), autonomous automobile systems, etc. In the case of smart-grid systems, for example, the power providers record and maintain household energy consumption information using smart meters. This information can reveal a lot about the lifestyle and other details of a household, and misuse could violate the privacy of the consumers. Thus, it becomes necessary to incorporate privacy-preserving techniques into such systems. Similarly, for healthcare and medical systems, data collected by IoT devices, such as blood pressure, blood sugar levels, and sometimes even location details, also need to be obtained in a privacy-aware manner.

The tech giants use DP in various application services. Microsoft uses LDP to protect user privacy in applications such as telemetry in Windows. Apple uses LDP to protect the privacy of user activity in a given time period while still gaining insights that would help improve the intelligence and usability of features such as QuickType and emoji suggestions. Google's RAPPOR (Randomized Aggregatable Privacy-preserving Ordinal Response) system integrated in Chrome acquires data in a privacy-aware manner on how unwanted software hijacks user settings. IBM and Google provide libraries for carrying out machine-learning tasks in a DP-aware manner.

Although these companies have used DP in several applications, researchers have questioned whether such implementations of DP provide adequate privacy guarantees in practice. Applying DP to record-level data collection or release requires employing a large amount of noise to ensure a safe ε. If ε ≤ 1, the analytical utility of DP outputs is likely to be poor.²

One way around this could be to use a very large value of ε to mitigate the utility problem. For example, Apple reportedly uses ε = 6 in MacOS and even ε = 43 for iOS 10 beta versions, whereas in RAPPOR, Google uses ε up to 9. This shows that the applicability of DP in practice is still a challenge as the privacy guarantee of DP is greatly diminished for such larger values of ε.⁹

Further discussion

The need for data privacy has unfolded from the standard use case of data publishing to privacy-driven analytics. Here, DP has gained significant attention because it offers mathematical guarantees. There are challenges, however, in mapping the theory of DP to practice.

What is the ideal DP setting?

The ideal DP setting for a particular use case is one that can mitigate the threats and risks of disclosure of sensitive data while keeping the data utility high. The requirement of privacy has always depended on the context. When the data controller is a trusted entity, the SDP model can be used; if the data controller is untrusted, the LDP model is suitable. In both cases, different DP mechanisms prevent leakage of sensitive information by malicious data analysts. Thus, depending on the use case and its requirements of privacy versus utility, an appropriate DP setting can be chosen.

Which is the right DP mechanism?

No universal DP mechanism is effective for all use cases. The Laplace mechanism can be used only for numeric queries, whereas the exponential mechanism can address both numeric and categorical data in the queries. Thus, the applicability of a mechanism varies based on the use case and the type of data. That said, many DP algorithms cater to specific use cases. New algorithms are being proposed that don't meet the precise mathematical definition and are therefore referred to as "almost differentially private."

How to select the value of ε?

The value of ε can be used to determine the level of privacy. The smaller the value of ε, the better the privacy, but the accuracy of the results may be affected. DP researchers suggest that ε greater than 6 may not be good from a privacy point of view.¹² Although that is certainly a good goalpost, it often may not be achievable given the nuances of the use case. Further, the choice of ε may vary from application to application depending upon the need for privacy in that context. In general, questions like "What value of ε is appropriate?" and "How much privacy is enough?" remain unanswered. There is no easy guide for this, and the best practices have not evolved yet.

When should the use of DP be discontinued?

Privacy losses accumulate²⁰ (i.e., with each new query, the privacy guarantee decreases as additional information about the sensitive data is released). This implies that after a certain number of queries, the application of DP provides no privacy guarantees. Ideally, the privacy loss should be small for strong privacy guarantees. Thus, to mitigate the risks of a growing privacy loss, you can enforce a maximum privacy loss denoted by the privacy budget. Each query can be treated as a privacy expense that incurs an incremental privacy loss. If the number of queries exceeds this privacy budget, then you can stop answering the queries, thereby discontinuing DP.

To avoid hitting the privacy budget limit, you could attempt to make the privacy expense almost zero for most queries. The choice of parameters, however, would typically mean very noisy responses and, therefore, negligible data utility. Thus, owing to either privacy or utility considerations, DP may not be suitable for long-running systems.

Emerging Techniques

Many international organizations promote privacy as a fundamental requirement. Among them, OECD (Organization for Economic Cooperation and Development) has provided general guidance for managing personal information. It has laid down certain principles, such as collection limitation, data quality, purpose limitation, use limitation, security safeguards, openness, individual participation, and accountability. These principles help in governing the privacy requirements during the life cycle of a system.

With a rise in the complexity of the systems, in which the storage unit and computing unit may not be centralized, mitigating privacy disclosure risks is challenging. Such systems, based on, for example, IoT sensors, wearable computing devices, mobile computing, and smart meters, require stronger privacy techniques and protocols. Such privacy techniques should consider the deployment architecture, compute availability at various nodes in the system, sensitive data flows, and threat models.

SWARA WONDERS: How can we go beyond k-anonymity and DP to preserve privacy for complex systems? Can these existing techniques be extended?

Interestingly, research in this direction has been evolving rapidly, and different frameworks and methodologies are being proposed. Suppose the healthcare app used by Swara needs to offer a disease prediction functionality. Such a feature may help Swara; however, using it requires reporting the symptoms. Based on the symptoms, the app would train a predictive machine-learning model using inputs from all its subscribed users. Since this sensitive information is being stored and processed, Swara and other patients are bound to have serious privacy concerns. Techniques such as federated learning could be applied here to get the best of both privacy and utility.

To build a global model for disease prediction while preserving privacy, a local model is trained over the data residing locally on each user's mobile device. The learned model parameters are sent by each user device to the cloud server where aggregation is performed to build a global model. This learned global model is pushed to each user's mobile device for predictions. This process of learning and improving a model locally, then pushing those updates to build the global model centrally and pushing the global model back for local usage, could continue.

Note that the data storage is local; this gives users control over their data while computation happens in a remote server. Thus, learning in such a setup enables privacy by sending model parameters and not sending users' data to a central compute server. This simple federated learning architecture (figure 18) also helps to realize some of OECD's privacy principles mentioned earlier.

These distributed architectures have further expanded with IoT data analytics. For example, in edge computing, heavy computation tasks get offloaded to the edge nodes while client devices such as IoT sensors and smart gadgets are assigned a lightweight task, the output of which is then used to perform heavyweight tasks at the edge nodes. Here, the LDPO (local differential privacy obfuscation) framework is being proposed to ensure data privacy and guarantee data utility for edge computing.

The basic approach of the LDPO framework is to add noise to prevent leakage of private information. Adding noise might degrade data utility, however, which is why the LDP-based data distillation model has been proposed. This limits the collection of personal data while still maximizing data utility.

The LDPO framework is based on components that involve learning the most compact and useful features of data using data minimization and perturbing these identified features with LDP for privacy guarantees. Further, these features are anonymized to a k-bit string using different hash functions so that the transformation yields a unique string. Lastly, the data is transmitted to edge servers, where feature rebuilding and distribution estimation are performed using hash functions for data reconstruction, thereby preventing sensitive attributes from being exposed.

Suppose Swara has participated in a research study where her health parameters are collected using a wearable health gadget. The LDPO framework, as illustrated in figure 19, safeguards Swara's personal data and helps in realizing some OECD principles. Besides the federated learning kind of distributed architectures, FHE (fully homomorphic encryption) and SMPC (secure multiparty computation) are cryptographic techniques that can be used for private computation on data (figure 20).

FHE is an encryption scheme that enables analytical functions to be run directly on encrypted data while yielding the same encrypted results as if the functions were executed on plaintext. If Asha Hospital were to use FHE for disease prediction, Swara's records would be homomorphically encrypted on her device locally and sent to the cloud server for processing. The outcome of the prediction model running in the cloud would also be in the encrypted form that would be sent to Swara's device. Swara would be able to decrypt the response, while no one else, including the cloud administrators, would be able learn anything about Swara's condition. Although this is exciting from a security and privacy standpoint, at the current state of the art, FHE computations run about a million times slower than the equivalent plaintext computations.

Even so, this is already a big improvement over the original trillion times slowdown at the inception of FHE. Several open-source implementations of FHE schemes exist today, and, given FHE's potential benefit to cloud computing, efforts are under way to develop even more efficient implementations, as well as to standardize FHE.

Alternatively, SMPC allows multiple parties to perform computations on their private data to evaluate a function of common interest: SMPC is highly applicable to machine learning as it allows companies to offer their models to perform inferences on the private data of clients while ensuring utmost privacy.

For example, the central server of Asha Hospital's healthcare application could be hosted on a cloud with every registered patient of the hospital, including Swara, having the healthcare app on their devices. Using SMPC, the cloud service provider could execute a trained classification model by securely sharing patient data and sending the securely computed result (such as prediction of a disease) back to the patient.

Several SMPC techniques have been known for a while; however, many of them involve significant message-passing overheads. Research is in progress to develop inexpensive, efficient, and effective SMPC techniques. There are also attempts to judiciously combine SMPC and FHE techniques to come up with hybrid schemes that have acceptable time and communication complexity.

Concluding Remarks

With the world's information now being reshaped in digital form, privacy of individual information has become a crucial concern for individuals and organizations. It is vital for organizations to understand and address the privacy concerns attached to any activity involving data. This article explains how the fictitious Asha Hospital can apply various privacy-preserving techniques. Each of the techniques has different strengths and weaknesses, depending upon the context (use case). There is no silver bullet yet—that is, no universal approach exists to guarantee privacy—but the use of state-of-the-art privacy-preserving techniques can largely avert the potential damage caused by a privacy breach.

Swara, who is privacy-aware, understands the value of the data she holds and the need to safeguard it. On the other side, organizations following privacy principles and using privacy- preserving techniques such as k-anonymity and DP in their operations can neutralize people like Betaal who have malicious intent. The promise of privacy extends beyond these two techniques to new approaches, such as federated learning, LDPO, and FHE.

Acknowledgments

This article was published as the first minigraph in the ACM India Minigraph series (2021). The idea of minigraphs was first suggested by Professor Mathai Joseph and further developed in consultation with Hemant Pande, executive director of ACM India, and Rajeev Shorey, chair of the ACM India Learning Initiatives Committee. We thank the referees of our original minigraph: John Mitchell, professor, Stanford University; Yuval Elovici, professor, Ben-Gurion University; and Sitaram Chamarty, principal scientist, TCS Research, for their valuable and constructive suggestions. We also thank Freya Barua, specialist, talent development, TCS, for helping us with her expert language editing and content orientation.

References

1. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A. 2005. Approximation algorithms for k-anonymity. Journal of Privacy Technology (November).

2. Bambauer, J., Muralidhar, K., Sarathy, R. 2014. Fool's gold: an illustrated critique of differential privacy. Vanderbilt Journal of Entertainment and Technology Law 16(4), 701-755; https://scholarship.law.vanderbilt.edu/cgi/viewcontent.cgi?article=1207&context=jetlaw.

3. Bayardo, R.J., Agrawal, R. 2005. Data privacy through optimal k-anonymization. In Proceedings of the 21st IEEE International Conference on Data Engineering (April), 217-228; https://dl.acm.org/doi/10.1109/ICDE.2005.42.

4. Britannica. People of India: Ethnic groups; https://www.britannica.com/place/India/People.

5. Chen, R.J., et al. 2021. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering 5.6, 493-497.

6. Ciriani, V., Di Vimercati, S.D.C., Foresti, S., Samarati, P. 2008. K-anonymous data mining: a survey. In Privacy-Preserving Data Mining, eds. C.C. Aggarwal and P.S. Yu, 105-136. Boston: Springer; https://link.springer.com/chapter/10.1007/978-0-387-70992-5_5.

7. Cormode, G. 2011. Personal privacy vs. population privacy: learning to attack anonymization. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1253-1261; https://dl.acm.org/doi/10.1145/2020408.2020598.

8. Dalenius, T. 1977. Towards a methodology for statistical disclosure control. Statistik Tidskrift 15, 429?444.

9. Domingo-Ferrer, J., Sánchez, D., Blanco-Justicia, A. 2021. The limits of differential privacy (and its misuse in data release and machine learning). Communications of the ACM 64(7), 33?35; https://dl.acm.org/doi/10.1145/3433638.

10. Dwork, C., McSherry, F., Nissim, K., Smith, A. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, 265-284; https://dl.acm.org/doi/10.1007/11681878_14.

11. Gavison, R. 1980. Privacy and the limits of law. The Yale Law Journal 89(3), 421-471; https://www.jstor.org/stable/795891.

12. Greenberg, A. 2017. How one of Apple's key privacy safeguards falls short. Wired (September 15); https://www.wired.com/story/apple-differential-privacy-shortcomings/.

13. Kisilevich, S., Rokach, L., Elovici, Y., Shapira, B. 2009. Efficient multidimensional suppression for k-anonymity. IEEE Transactions on Knowledge and Data Engineering 22(3), 334-347; https://ieeexplore.ieee.org/document/4840348.

14. Lodha, S., Thomas, D. 2007. Probabilistic anonymity. In Proceedings of the First ACM SIGKDD International Conference on Privacy, Security, and Trust in KDD, 56-79; https://dl.acm.org/doi/10.5555/1793474.1793480.

15. Meyerson, A., Williams, R. 2004. On the complexity of optimal k-anonymity. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 223-228; https://dl.acm.org/doi/abs/10.1145/1055558.1055591.

16. Nissenbaum, H. 2004. Privacy as contextual integrity. Washington Law Review 79(1), 119-158; https://digitalcommons.law.uw.edu/wlr/vol79/iss1/10/.

17. Samarati, P., Sweeney, L. 1998. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical Report. SRI International; http://www.csl.sri.com/papers/sritr-98-04/.

18. Stokes, K., Torra, V. 2012. Reidentification and k-anonymity: a model for disclosure risk in graphs. Soft Computing ? A Fusion of Foundations, Methodologies and Applications 16(10), 1657-1670; https://dl.acm.org/doi/10.1007/s00500-012-0850-4.

19. Sweeney, L. 2002. K-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5), 557-570; https://dl.acm.org/doi/10.1142/S0218488502001648.

20. Ullman, J. 2016. Answering n^2+o(1) counting queries with differential privacy is hard. SIAM Journal on Computing 45(2), 473-496; https://epubs.siam.org/doi/10.1137/130928121.

21. Wagh, S., He, X., Machanavajjhala, A., Mittal, P. 2021. DP-cryptography: marrying differential privacy and cryptography in emerging applications. Communications of the ACM 64(2), 84-93; https://dl.acm.org/doi/10.1145/3418290.

Sutapa Mondal is a researcher at Tata Consultancy Services Ltd. (TCS). She works in the area of data privacy and security at the Cybersecurity and Privacy lab. Her research interests focus primarily on privacy-aware systems in the domain of cloud and service operations. Sutapa has received her Master of Technology from IIIT, Delhi, in computer science with specialization in data engineering and her Bachelor of Technology in information technology from Banasthali University, Rajasthan. She can be reached at [email protected].

Mangesh Gharote is a senior scientist at Tata Consultancy Services Ltd. (TCS). His research centers around privacy and security aware resource management, especially in the domain of Cloud computing and Service operations. Mangesh has received his PhD in Operations Management and Master in Industrial Engineering and Operations Research from IIT Bombay. He can be reached at [email protected].

Sachin P. Lodha is a chief scientist at Tata Consultancy Services Ltd. (TCS) and heads its Cybersecurity and Privacy research. He has a special interest in privacy and related topics. His efforts on that front have led to multiple award-winning innovations. He holds a Bachelor of Technology degree from IIT Bombay, and MS and PhD degrees from Rutgers University, all in computer science. He can be reached at [email protected].

Originally published in Queue vol. 20, no. 3—
Comment on this article in the ACM Digital Library

Privacy of Personal Information

Going incog in a goldfish bowl

Sutapa Mondal, Mangesh S. Gharote, and Sachin P. Lodha TCS Research, Tata Consultancy Services Ltd.

Privacy's Grand Challenge

A closer look at the data stored by Asha Hospital

Betaal, the Malevolent

Disclosure risk

The data juggle

What if PII is removed?

What if sensitive data is further shuffled?

What if everything is removed?

Tug of war: privacy vs. utility

K-anonymity: An Overview

A closer look at k-anonymity

Extensions to k-anonymity

Further discussion

The right choice of k

Identifying quasi-identifiers

The ideal amount of generalization to achieve desired anonymization

Efficiently achieving k-anonymity

Takeaways

Differential Privacy: An Overview

A closer look at differential privacy

How is a simple differential privacy mechanism differentially private?

What if Betaal poses a query structurally different from but equivalent to the query he posed earlier?

Industrial outlook

Further discussion

What is the ideal DP setting?

Which is the right DP mechanism?

How to select the value of ε?

When should the use of DP be discontinued?

Emerging Techniques

Concluding Remarks

Acknowledgments

References

Sutapa Mondal,
Mangesh S. Gharote,
and Sachin P. Lodha
TCS Research, Tata Consultancy Services Ltd.