What is the Privacy Accounting usage?

Hello Opacus team,
Can someone explain the usage of the " Privacy Accounting" and, is there any application documentation, or code example regarding it?

Hello,

Privacy accounting is used to track the privacy budget. It is useful to know how much privacy is lost at different stages of the training.

And this is how to use it: opacus/Migration_Guide.md at main · pytorch/opacus · GitHub

And some examples (search for “get_epsilon”):

@ashkan_software Thanks for the reply.
From tutorials and some blogs I get the concept of privacy budget as the “Quantitative for privacy protection", in Opacus it is the (epsilon, delta) pair.
Below are some of my understanding, I am not sure if there are correct or not.

It looks like there are two ways:

  1. User specifies the “total epsilon to be used-- target_epsilon” and then uses the below API–in such case, epochs/target_epsilon/delta/max_grad_norm are the argument, and it looks likely the if accumulate the actual epsilon spent for each epoch train, that shall not exceed “target_epsilon”.
model, optimizer, data_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=data_loader,
    epochs=EPOCHS,
    target_epsilon=EPSILON,
    target_delta=DELTA,
    max_grad_norm=MAX_GRAD_NORM,
)
  1. another way is like this, in such a case, user specifies the noise_multiplier and the max_grad_norm as arguments, and then the epsilon/delta/(and a sigma) can be get from “get_epsilon”
privacy_engine = PrivacyEngine()
model, optimizer, data_loader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=data_loader,
    noise_multiplier=1.1,
    max_grad_norm=1.0,
)

My questions:

  1. Is my above description correct?
  2. What is the major difference between these two methods particularly for the using scenarios (they are “mutually exclusive”, looks like for me the mutually exclusive between epsilon and noise_multiplier ).
  3. I did not find many documents explaining the noise_multiplier and max_grad_norm (maximum gradient l2 norm, i.e. the Clipping value–is that Sigma in project), can you suggest whatever material for that?
  4. Finally, I did not find the obvious link between the privacy budget (epsilon/delta etc…) vs Privacy Accounting, which includes api like IAccountant, RDPAccountant and GaussianAccountant what is the logic behind that?

Hello,

Let me answer your questions:

  1. Your understandings are correct. Just keep in mind that when calling privacy_engine.make_private_with_epsilon, we want to make sure that epsilon is below the targe_epsilon, across all epochs, not just one (similar to here where the epsilon does not exceed 12). Your understanding of the second way of calling make_private_* you described is correct.
  2. The major difference is simply in scenarios where you know your privacy budget and want to keep the privacy below that threshold vs. the scenarios where you do not know the privacy budget in advance and trying to see what privacy you get for different noise and clipping parameters.
  3. We have some great tutorials and videos listed here in our tutorials that answer your question about noise_multiplier (or sigma) and max_grad_norm (or c) (for example, intro to DP-SGD or the videos).
  4. Privacy is calculated in the privacy accounting classes. Think of it this way “privacy accounting” = privacy calculation". Now there are different ways of calculating that privacy. We implemented 2 of them: RDP and Gaussian

Thank you so much and I still have some further detailed questions…

  1. Your understandings are correct. Just keep in mind that when calling privacy_engine.make_private_with_epsilon, we want to make sure that epsilon is below the targe_epsilon, across all epochs, not just one (similar to here where the epsilon does not exceed 12). Your understanding of the second way of calling make_private_* you described is correct.

For make_private_with_epsilon API, let’s say, the target_epsilon = 50, and epoch = 10 , max_grad_norm = 1.2, targe_delta = 1/(2*len(dataset))
1 the epsilon spent for each training epoch, will be smaller than 50, or will be smaller than 50/10=5?, or some other value? in my test, I see the reported epsilon for each epoch will be increased epoch by epoch from about 48 to more than 100…
2. how to choose a proper epsilon value anyway?
3 How does max_grad_norm is defined then, this is the max grad clipping value used for “clip each p.grad in model parameters” from my study.

For noise_multiplier API, there is another parameter “max_per_sample_grad_norm”:
4. what is the difference between “max_grad_norm” and “max_per_sample_grad_norm” (this likely only matter when a clipping method is different(CLIP_PER_LAYER = false, clipping = “per_layer”)
5. furthermore in the example code, if a flag named CLIP_PER_LAYER = true, clipping = “flat”, there is a calculation as max_grad_norm = [max_per_sample_grad_norm / np.sqrt(n_layers) ] * n_layers
Could you explain a little more about this-- I don’t understand why an sqrt is used here.
6. what does the “noise_multiplier” really mean and how to choose that properly: in the api document it is defined as “The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (How much noise to add)” but that is still not clear to me, any detail reference document can be refered?

Hi @Leonmac

the epsilon spent for each training epoch, will be smaller than 50, or will be smaller than 50/10=5?

The total privacy budget doesn’t increase linearly with epoch, and the epsilon you “spend” on each epoch will decrease.

That said, target_epsilon parameter passed to make_private_with_epsilon is the expected epsilon at the end of training (assuming you trained for a given number of epochs). If that’s not the case, I suggest looking into the implementation of get_noise_multiplier - some of the default values there might have been picked under the assumption the target epsilon is smaller (<20)

how to choose a proper epsilon value anyway?

Good question, but unfortunately there’s no good answer. Understanding how to interpret epsilon is tricky, and so is picking the most appropriate value for your use case. What I can recommend thought is to look at what values are used in existing literature and start from there. You can also check which values were used in real-world applications (e.g. A list of real-world uses of differential privacy - Ted is writing things)

Off the top of my head, people usually work with eps<10 and something around 3 is considered strong privacy

How does max_grad_norm is defined then, this is the max grad clipping value used for “clip each p.grad in model parameters” from my study.

I’m not sure I understand the question. max_grad_norm isn’t used during privacy accounting. On each training step we compute per sample gradient norm (i.e. imagine each sample is processed in its own batch and all of the parameter gradients from all the layers are put into one vector) and shrink it (or “clip”) so that the L2 norm is no greater than max_grad_norm

what is the difference between “max_grad_norm” and “max_per_sample_grad_norm”

They are the same thing, max_grad_norm is just shorter and easier to write

Could you explain a little more about this-- I don’t understand why an sqrt is used here.

Ok, we can think about it this way. Imagine we have two layers with 100 parameters each in our network and we’re using max_grad_norm=1
If we’re using regular flat clipping, it just means that L2 norm of a 200-element vector should be <=1.
With per-layer clipping we want to have separate clipping thresholds for each layer. However, the top-line requirement still stays - overall gradient norm (L2 norm of a 200-element vector) should be <= 1.

This means, that each layer’s total norm should be <= 1.0 / sqrt(2). Square root is popping up because L2 norm is a square root of sum of squares - therefore if we want the sum of two elements to have L2 norm <= 1, each element should be <=1/sqrt(2)

what does the “noise_multiplier” really mean and how to choose that properly:

One thing that could help is reading Algorithm 1 from the original DP-SGD paper, sigma there is the same as noise_multiplier in our code