Why BatchNorm layer is not compatible with DP-SGD

DmsKinson · June 15, 2022, 1:20pm

Hi,
I want to figure out why BatchNorm layer is not compatible, I’ve seen two explanation:

BatchNorm Computes the mean and variance across the batch, creating a dependency between samples in a batch, a privacy violation. Noted in opacus Opacus · Train PyTorch models with Differential Privacy and [1],[2]
BatchNorm will use mean and variance computed through training data in evaluation, noted in [3].

Explanation 1 is cited more widely , but I don’t understand why computations before adding DP guarantee would violate privacy？
Explanation 2 is well understood, and the proposed approach in [3] to let the BatchNorm adapt DP does not consider the reason mentioned in explanation 1.

I’m quite confused. Can somebody tell me why?
Thanks in advance.

Ref:
[1] Shamsabadi A S, Papernot N. Losing Less: A Loss for Differentially Private Deep Learning[J]. 2021.
[2] Yu D, Zhang H, Chen W, et al. Do not let privacy overbill utility: Gradient embedding perturbation for private learning[J]. arXiv preprint arXiv:2102.12677, 2021.
[3] Davody A, Adelani D I, Kleinbauer T, et al. On the effect of normalization layers on Differentially Private training of deep Neural networks[J]. arXiv preprint arXiv:2006.10919, 2020.

alexandresablayrolle · June 20, 2022, 1:42pm

Explanation 1 is correct. In DP-SGD, we replace the sum of gradients by a “noisy sum” where each sample is chosen to participate independently with probability q (the sampling rate), its gradient is clipped and Gaussian noise is added to the sum.
The important aspect is that each sample’s contribution to the sum is bounded (in our case by the clipping constant C). In particular, adding or removing a sample from a batch has an impact of at most C on the sum of gradients. When using batch norm, adding or removing a sample can impact other sample’s gradients and thus the contribution is not bounded anymore.

DmsKinson · June 21, 2022, 3:34am

Hi alex,
Thanks for your information! I have another question.
What if I normalize the dataset before training, such as torchvision.transforms.Normalize(mean,std) in pytorch, would the adjacent dataset in the definition of differential privacy become the normalized dataset differing by one sample instead of the original dataset? Because the normalization after adding or removing a sample will impact the whole dataset.
Is there any difference in guarantee provided by DP-SGD with or without dataset normalization, since we want to protect original data?

alexandresablayrolle · June 22, 2022, 3:41pm

DP really aims at hiding the presence of individual samples, so whether it’s normalized or not doesn’t really make a difference. Note that this is assuming that mean and std are not computed on the dataset itself, or that they are computed privately!

DmsKinson · June 22, 2022, 5:11pm

Let’s say, I have two same datasets, only one of them is normalized with mean and std computed on itself. If I use DP-SGD with the same settings to train models on these two datasets, I will obtain the same epsilon through API, but these two models provide different differential privacy guarantees, am I right?

jeff20210616 · January 23, 2023, 3:21am

The clipping is done on the obtained gradients, while the batchNorm is done in each round of forward propagation. I think the gradient of each sample can still be bounded according to the clipping between. So, we still get the sensitivity.

alexandresablayrolle · January 30, 2023, 9:35am

So here the two models will have the same privacy guarantee, but one of them is kind of useless “as is” (the one that operates on normalized data), because it needs mean and std to be able to normalize samples of the test set. Now giving this mean and this std incurs an additional privacy loss.

alexandresablayrolle · January 30, 2023, 9:38am

This topic seems unrelated, for further questions please open a new issue.

You are correct that the gradients are still bounded by the clipping value, but the sensitivity is the difference in the (clipped) gradients’ sum when we add or remove one sample. If we use batch norm, removing one sample will not only affect this sample’s (clipped) gradient, but also other samples’ gradients, so the sensitivity is higher.