A friend, and NYIT, have asked me to do a CISSP review seminar. Since I've taught the seminars for two decades, first for ISC2 and then for various other commercial training companies, this is not hard. I'm about 70% through my first draft. At the same time, I'm going to be giving the differential privacy presentation on Friday. So Gloria asked me if I was going to be putting any differential privacy content into the review seminar.
I had to think about that. For one thing, knowing what I know about the CISSP exam question process, I very much doubt that anyone (other than myself) has yet created any questions about differential privacy in the CISSP exam question style. (There is plenty of trivia in regard to differential privacy that can be used to make up questions to prove how smart you are in comparison to the other guy, but that isn't the CISSP question style.)
But the next problem is, where would I put it within the domains? Would it go in Law, Investigation, and Ethics, which is where we usually talk about privacy? But differential privacy isn't really about privacy. At least not your privacy. It's not something you can do, but something that enterprises, developers, and whole infrastructures of the IT universe have to put in place in order to protect privacy on a much larger scale. Do I put it in crypto? There's lots of math involved, some of it similar to a lot of work in various corners of crypto (although not exactly the same). Or should it go into Applications Security, since most of it primarily applies to databases and queries and it has to be baked in to database design at a pretty structural level in order to actually work.
Part of the problem is that differential privacy isn't actually a single "thing." It's an amalgam of a number of ideas and technologies, none of them actually new, trying to address some interesting, and long-term, problems of privacy and disclosure. Trying to see whether these approaches actually work has raised some new issues and concepts, and differential privacy probably will provide some important and interesting approaches to some aspects of privacy and database design in the years to come. But it's kind of like Public Key Infrastructure (PKI) in crypto: you've got a lot of moving parts, and you have to make sure they are all properly in place in order to have the system work properly and not be in danger of some kind of attack on your implementation. It's also kind of the quantitative risk analysis of privacy and database design: there are a lot of details, and it's a lot of work, and most people are going to be too lazy to try to make it work properly.
At best the topic and "mechanisms" of implementing differential privacy would be better suited for CSSLP, but that was before the recent "refresh". Now every domain has the word "secure" in it. So how do we fit those technical topics which are "privacy preserving" into the certification? Not sure. (ISC)2 pretty much stays out of that lane...
I had to do a little homework as I had not heard of this. Yup, it slipped by me. I would have to say the domain it would fit in would be directly related to what about it is being tested. Is the use of being tested or is the technical understanding of it. Like with encryption when to encrypt data may be more of a legal / ethical domain, but what encryption is I guess is a domain in its self.
If this makes any sense...
@JKWiniger Hey I had to look it up too! So basically it is anonymization of data elements so that someone studying the data cannot find out the private information of someone based off the data. For example, if I had a chart/spreadsheet of people with HIV/AIDS in a certain country, I would not be able to find out who they were as individuals, and thus invading their privacy. But a government health agency may need to know that data so they can increase awareness efforts, or increase activities to reduce the spread, or etc..
In one of the data warehouses I worked at we had an anonymization procedure that assigned unique ID's to the incoming data, thus stripping the private information link away from the data. The person(s) that anonymized the data had no access to the data after anonymization. The people working with the data had no access to the anonymization procedure. We kept strict rules in place around the protection of the procedure and where the procedures happened. We also isolated the anonymization procedure so that it was physically separated from the network that held the data. If an attacker got in, they could not reverse engineer the data to find out who the individual was. We further isolated especially sensitive data that could be aggregated to narrow down the search. For example if we held data about a rare disease and that data included a living area code, we ensured that only the necessary data was pushed forward into the data warehouse.
If you ever want to get a real "anonymous" survey filled out with true data you have to convince the people taking the survey that the end data has had differential privacy applied to it. You can tell an employee it is anonymous and to answer the questions in a brutally anonymous fashion (i.e not hiding displeasure with management for fear of retribution) but when they get to the demographic portion of the survey that identifies them as the only GS-13, white male in the IT security division of the AB division of XYZ corporation in the 35-40 year old range, they start having second thoughts about submitting the survey truthfully out of fear of retribution. That is usually one of the reasons why employee surveys have a hard time of reaching 100% participation rates with "honest" responses. Once the employee realizes that they can be identified through certain key demographics, they either fail to submit the survey or the go back and change their answers to more polite or politically correct ones.
So survey creators take note, demographics are important, but asking for them may skew your results, so consider if they are really necessary or make two surveys that are not linked. Or do a really good job of explaining how the responses will not be linked to an individual.
In the first account I composed, O Mystikophilus, I began to tell of all that differential privacy was and could do for us.
Which is, perhaps, only to be expected. After all, we still don't agree what privacy is. It is pretty much impossible to get a strict and working definition of what privacy actually is, at least in terms that are useful in the information age. Everyone has personal and subjective feelings about what information is and is not private.
One of the best definitions I've ever come across states that privacy is your ability to control information about you. And that ability has never been absolute. (And I don't just mean Scott McNealy's "You have zero privacy anyway. Get over it.") We live in communities, and the people around you can see and hear you, see where you go, note who you talk to. That's been the reality since we began to be able to talk. We can, temporarily, shroud ourselves, whisper to another, or get away from the group, but our right to privacy is not binary, in the same sense as the right to life or free speech. We don't, therefore, have a "right" to privacy any more than we have the "right to be forgotten" in anything other than a purely artificial sense.
This is reflected, to an extent, in our laws and constitutions. They don't mention much about privacy. In my original presentation, I was challenged on this statement by someone from Portugal, who said that Portugal's constitution does we have a right to privacy. But the right to privacy that it mentions really only limits what the government can do with information about you, like the American Privacy Act of 1974. (Since they were written about the same time, this is hardly surprising.) He then said that the first mention of privacy in an English law dates to 1361. But, again, that law says that the authorities can't look into the window of your house, and is much more about illegal search than it is about what we consider private.
In a fairly major sense, differential privacy avoids all of this definition of privacy issue by not caring what privacy is. Differential privacy is more concerned with databases, and queries on databases. Specifically it looks at the problems of inference and aggregation attacks. An inference attack is where you can infer, from information you are given access to, information that you are not given access to. For example, suppose I am given access to a database that holds information about prices, but does not tell me about supply. If I see that the price of a certain commodity is going up, I can infer that the supply is going down, even though I've been forbidden access to that data. Aggregation is the ability to find out restricted information by combining available information, often from a variety of sources. The whole field of open source intelligence (OSInt) is based on this idea. In database security, inference and aggregation attacks are a long-standing problem with very few solutions.
We can, of course, try to address the problem by restricting what queries are allowed. We can say that individual items can't be reported, and that only queries on groups of data are allowed. (Aggregation can be both attack and defence.) Unless we are very careful about that, we get the situation where we say that you can't know Rob Slade's salary, but you can know the average salary of everyone in Vancouver. But if we then allow that you can ask for the average salary of everyone in Vancouver except for Rob Slade, we can aggregate those two queries and infer what Rob Slade's salary is.
So, what can we do about it? Well, you remember Bell-LaPadula? Of course you do. They came up with a simple security property. (For confidentiality. They were only concerned with confidentiality.) If you don't want people to know secret information, don't tell them. If you are at a medium security level, you can't see any high security information, and you can't tell anything to people who are at a low security level. No read up, no write down. Simple.
Ah, if only life were so simple. But try to build a Bell-LaPadula computer. (OK yes, I know. "Multics." Fine. Try and build a computer that combines Bell-LaPadula and Biba. Come back when you're done.) However, formal models do give us guidance and can be very useful in getting our minds around the problem. So, in 2006, some people thought about how to protect against aggregation and inference attacks on databases. (Dwork/McSherry/Nissim/Smith, “Calibrating noise to sensitivity in private data analysis,” Proceedings of the Third conference on Theory of Cryptography)
So, some simple principles. Well, a person's privacy cannot be compromised by a statistical release if their data are not in the database. That's basic. You can't have your privacy violated if your information isn't there. So, how can we make it as if your information isn't there? The goal of differential privacy is to give each individual roughly the same privacy that would result from not having their data in the database. (Hence the "differential" part: there should be no, or next to no, "difference" in queries whether your data is there or not.) So the only functions (mostly statistical) run on the database should not overly depend on the data of any one individual.
And, that leads to the Fundamental Law of Information Recovery: in the most general case, privacy cannot be protected without injecting some amount of noise. And as queries are made on the data of fewer people, you need more noise.
So how do we get this to work? (to be continued ...)
As I have said, differential privacy is not the type of privacy we normally think of when we think of privacy. But it is related, and can be valuable. Coincident with starting this research and writing on differential privacy, I have been watching "Search and Rescue: North Shore," which is a terrific five part documentary series about the team and it's activities. I believe it is available for streaming simply by signing up (for free) at the Knowledge Network.
I highly recommend it. Not only is it the gorgeous scenery of where I live, and some of the emergency management people I've trained with, it also has numerous lessons about planning, training, risk analysis, and other elements important to security management, security operations, and business continuity. It is a wonderful example of filmmaking. It must have been a bear to edit, since they not only embedded cameramen with the teams, but, in a number of cases, had helmet cameras, fixed cameras inside helicopters, cameras fixed to quad bikes, cameras fixed to rope gear, and even aerial drone shots, all of which had to be spliced together to create a whole, and seamless, narrative.
It also gives you yet another example of an inference attack. Since it involves real situations, real rescues, and real people, the filmmakers had to get permission from those involved in cases where you could clearly identify someone. In some cases, that permission obviously wasn't given, and faces are blurred out in the final series. This allows you to infer who was OK with being involved in the final product, and who was more bashful (or embarrassed).
As previously noted, aggregation and inference attacks have been a persistent and intractable problem in database security. But aggregation can also be used as a protection. British Columbia's provincial health officer, Dr. Bonnie Henry, has done a masterful job both of managing the CoVID-19 pandemic, and leading the communications about it. For months she has, on an almost daily basis, provided a great deal of data on the situation. However, initially that data was only provided on the five major health regions of the province. The reporters asking questions on "The Doctor Bonnie Show (co-starring Adrian Dix and Nigel Howard)" have consistently demanded data by towns, schools, and even individual neighbourhoods. As Dr. Henry has pointed out, providing data on that level, when the numbers are small, would allow for inference attacks that could identify individuals, so only sufficiently large sets of aggregated data are provided, in order to preserve individual privacy. As the numbers, of cases, outbreaks, and even, unfortunately, deaths, have increased, however, it has become possible to provide information based first on local health areas, and now on individual towns. (Hopefully it won't get to the point where it is safe to provide data on individual neighbourhoods.)
Aggregation may have to be done carefully. There are situations, and certain types of data, where you may wish to have anonymization taking place prior to aggregation. In those cases, you may even wish to have separate teams doing the anonymization and the aggregation, and a Brewer-Nash type firewall between those teams, so that the aggregated data may not be re-identified. And, of course, the design of the anonymization and the design of the aggregated database in such a way that it is not possible to de-anonymize the data is a non-trivial task.
Aggregation is not a new concept in database security. We've been using it for years. Even decades ago it was part of the notion of data warehousing, with the idea being that we would use lots of lots of data that had no real personal information and pull out insights without ever having to risk someone's personal privacy. But, as with most simple answers, there are problems. In many cases, data can't be completely anonymized and still remain useful. And anonymization isn't limited simply to the removal of personally identifiable information. Anonymization doesn't even seem to be enough. The trouble is, aggregation itself seemed to lead to privacy risks. At one point Google made a bunch of its search data available to the public. The feeling was that no personal information had been involved, and therefore there was no risk to privacy. However, some privacy experts started digging into the data, and found that, simply given the volume of the data, it was, in fact, fairly simple to collect searches related to an individual, and, in many cases, identify them. It's also now fairly widely accepted (except by most of the general public, it seems) that the aggregation of even trivial posts on social media can result in the compilation of dossiers that spy agencies of the past would have gladly killed for. As has been pointed out, the NSA didn't have to go to all that trouble to illegally collect data on Americans: they just had to read Facebook.
So, that leads us to the Fundamental Law of Information Recovery, and noise.
But the next problem is, where would I put it within the domains?
Based on how I understand it and read here: NIST Differential Privacy Blog Series
I'd place this in Domain 3: Security Architecture and Engineering. This is the same domain where encryption, PKI, and key management are covered. It's a very interesting concept where the amount of junk data / noise affects the accuracy of the results.
Of the CISSP sample questions which I have collected over the decades, one of my very favorite is this one:
Which of the following is NOT an effective deterrent against a database inference attack?
b. Small query sets
c. Noise and perturbation
d. Cell suppression
Why do I like it so much? I have found that a lot of people get this one wrong. Remember, you are supposed to answer the question asked, from the answers provided. We were not asked, "Is it a good idea to add noise to your database?" We were asked whether it would help in a specific situation.
First off, let's get rid of a and d. Database inference attacks are an old and established threat against database systems, and are not subject to many defences. Partitioning and cell suppression may not help much, but they do help.
Now we are left with small query sets (b) and noise and perturbation (c). Lots of people choose noise and perturbation, because, well, noise. We don't want to introduce errors into our databases, do we? That has to be the worst (and therefore, in the wording of this question, right) answer.
The thing is that small query sets are, specifically, one of the tools that you do use to mount inference attacks. So small query sets are, specifically, NOT an effective deterrent against a database inference attack.
And what about noise and perturbation? Well, if you are really, seriously, concerned about inference attacks, introducing small sources of noise and perturbation (very carefully) is a very effective protection.
Sometimes we can add quite a bit of noise, and still have useful information (and privacy). The social sciences have a system called "randomized response" for situations where you want to poll a group or population about embarrassing or illegal behaviour. If you want to ask people if they have ever murdered someone, they'll probably just answer "no." However, the randomized response system tells them to flip a coin. If the coin comes up heads, answer truthfully. If the coin comes up tails, then flip the coin again, and answer "yes" if heads, and "no" if tails. Since, from outside, we don't know if the subjects got heads on the first coin toss, we don't know if they answered truthfully or not to the question. Since this preserves their privacy, they are more likely to answer truthfully. We can, statistically, remove the "noise" since we know that 25% of the total answers will be "yes" on the basis of the random coin flipping.
Sometimes the noise we introduce can be done simply on the basis of rounding. If we have the classic case of asking "What is the average salary in Vancouver?" and then asking "What is the every salary for everyone in Vancouver except Rob Slade?" then merely rounding the answers to the nearest thousand (or even hundred) dollars probably skews the numbers enough that you won't be able to calculate my salary with any degree of accuracy.
The amount and type of noise that will protect privacy and yet still allow useful results will likely depend upon the data being collected and the questions being asked.