I’ve been wandering around this topic for a while now and could not find a really pleasing answer:
As far as I’ve understood, the privacy accounting mechanism of Opacus and TF Privacy are based on the same Gaussian Sampler algorithm or at least are similar to an extent. However, Opacus seems to have a ‘problem’ with Batch Norm layers because of the untraceability of individual sample gradients, which makes sense. On the other hand though, TF Privacy does not seem to have a problem, or at least a documented problem, with these kinds of layers. If this is the case, what is the difference between the two that makes TFP prone to the downside of Batch Norm? Or does TFP just don’t care about the Batch Norm layers and therefore yield a privacy budget that is so to say wrong?
Another question would be: if we set the hyperparameters of our PrivacyEngine by ourselves, meaning that we do not state an upper bound on epsilon in advance, would we be able to calculate the privacy budget spent after each iteration even with Batch Norm layers present in our model or are these layers completely unsupported because of the reason above?
I am not sure how TF Privacy handles batch norm, but in general batch norm is not DP-friendly, because DP-SGD assumes that one sample does not influence other samples’ gradients in a batch. This means that if you use batch norm (in the sense of normalizing across samples in a batch), you cannot use DP-SGD in the regular sense.
However there are some subtleties.
In practice, “batch” norm for images means normalization across the BHW channels (batch, height and width). This means that you can batch normalize individual images and it will still work. This will happen if you do microbatching (forward/backward samples one by one), and can be obtained efficiently in Opacus by using functorch to compute per sample gradients (GradSampleModule set to “no_op”).
There should still be a way to compute an epsilon even if you use batch norm and accept that one sample “contaminates” other samples in the batch. For example if you have a constant batch size of B and a clipping constant of C, the new sensitivity is C*B. However, this is probably a very gross overestimation of the sensitivity and will lead to ridiculously high privacy budget.
As far as I can see, TFP does not even depend on the model architecture. It just uses the sampling probability which basically is the BATCH_SIZE / TRAIN_SET_SIZE, noise multiplier, number of epochs and the delta value that could be tolerated. It uses the Gaussian Query Mechanism for sums, which I believe is also used when we set accountant=‘gdp’ on our privacy engine. They simply wrap the optimizer with its’ DP version and calculate the privacy budget with the parameters mentioned above.
Now of course the question would not be why they are doing this, but rather why Opacus is dependent on the model architecture and/or if this actually results in a tighter/better analysis of the privacy budget.
(sorry for the late reply) I believe TFP is doing micro-batching, i.e. forwarding batches one by one to compute per-sample gradients (essentially option 1 from my message above). In Opacus, the way we compute per-sample gradients is different (einsums) and more efficient (as it allows batch computations).
In a nutshell, Opacus should be faster but the privacy budget is not tighter/better.