Why BatchNorm layer is not compatible with DP-SGD

I want to figure out why BatchNorm layer is not compatible, I’ve seen two explanation:

  1. BatchNorm Computes the mean and variance across the batch, creating a dependency between samples in a batch, a privacy violation. Noted in opacus Opacus · Train PyTorch models with Differential Privacy and [1],[2]

  2. BatchNorm will use mean and variance computed through training data in evaluation, noted in [3].

Explanation 1 is cited more widely , but I don’t understand why computations before adding DP guarantee would violate privacy?
Explanation 2 is well understood, and the proposed approach in [3] to let the BatchNorm adapt DP does not consider the reason mentioned in explanation 1.

I’m quite confused. Can somebody tell me why?
Thanks in advance.

[1] Shamsabadi A S, Papernot N. Losing Less: A Loss for Differentially Private Deep Learning[J]. 2021.
[2] Yu D, Zhang H, Chen W, et al. Do not let privacy overbill utility: Gradient embedding perturbation for private learning[J]. arXiv preprint arXiv:2102.12677, 2021.
[3] Davody A, Adelani D I, Kleinbauer T, et al. On the effect of normalization layers on Differentially Private training of deep Neural networks[J]. arXiv preprint arXiv:2006.10919, 2020.

Explanation 1 is correct. In DP-SGD, we replace the sum of gradients by a “noisy sum” where each sample is chosen to participate independently with probability q (the sampling rate), its gradient is clipped and Gaussian noise is added to the sum.
The important aspect is that each sample’s contribution to the sum is bounded (in our case by the clipping constant C). In particular, adding or removing a sample from a batch has an impact of at most C on the sum of gradients. When using batch norm, adding or removing a sample can impact other sample’s gradients and thus the contribution is not bounded anymore.

Hi alex,
Thanks for your information! I have another question.
What if I normalize the dataset before training, such as torchvision.transforms.Normalize(mean,std) in pytorch, would the adjacent dataset in the definition of differential privacy become the normalized dataset differing by one sample instead of the original dataset? Because the normalization after adding or removing a sample will impact the whole dataset.
Is there any difference in guarantee provided by DP-SGD with or without dataset normalization, since we want to protect original data?

DP really aims at hiding the presence of individual samples, so whether it’s normalized or not doesn’t really make a difference. Note that this is assuming that mean and std are not computed on the dataset itself, or that they are computed privately!

Let’s say, I have two same datasets, only one of them is normalized with mean and std computed on itself. If I use DP-SGD with the same settings to train models on these two datasets, I will obtain the same epsilon through API, but these two models provide different differential privacy guarantees, am I right?