Question regarding Gradient Clipping

fritz-max · March 13, 2021, 9:17pm

Hello!
I have a question about Gradient Clipping, that arises from the following principles of privacy accounting and DP-SGD:

The RDP calculation for each step in training is based on the ratio between maximum norm bound of the gradients and the std. deviation of the noise being added to them. This ratio is known as the noise multiplier. As long as the ratio stays the same, the privacy guarantee for a given real valued function does not change. So if I want to increase the maximum norm bound (sensitivity) of a real valued function, the noise std. dev. just has to be scaled by the same amount to satisfy the same privacy. #11 (see also Opacus Issue #11 , as well as Proposition 7 / Corollary 3 in https://arxiv.org/pdf/1702.07476.pdf)

Given this, I want to discuss the following example:
Suppose (for the sake of simplicity) that I have chosen a norm bound of B = 1, and that the corresponding noise std. dev. sigma is also 1. The noise multiplier z = sigma/B = 1, and this real valued function then satisfies (alpha, alpha / 2*z^2)-RDP.
Consider then the following two cases during training:

If the gradient is of size 1 at a particular step in training, the noise fits the exact sensitivity of the gradient and the privacy is accounted for in a reasonable way.
However, If the gradient is less than the norm bound, let’s say of size 0.5, the noise of scale 1 suddenly is too big for the now smaller sensitivity. As stated above, B and sigma could be scaled down to 0.5 as well to satisfy the same privacy guarantee as before. Worded differently, if for this step we would change B = 0.5 (which is just as valid a clipping bound as 1 and yields the same update to the gradient) but keep sigma = 1, this would satisfy a different privacy guarantee while providing the same update to the parameters (as having B=1, sigma=1). More specifically the guarantee should be equivalent to doubling the size of z resulting in alpha, alpha / 2*(2*z)^2) = (alpha, alpha / 8*z^2)-RDP.

My question now is, is there an obvious reason as to why this is not considered in privacy accounting? It does not seem to me that the accountant takes into consideration the actual scale of gradients or scales noise accordingly. The clipping threshold and noise multiplier are constant hyperparameters that are to be freely chosen by the user of Opacus. Because these are constant the noise that is added to the gradients is also always constant. As the sizes of the gradients most definitely are not, this leads me to believe that the privacy calculation sometimes would yield the first case for which the noise is correctly scaled and at other times the second case listed above for which the noise is not accurate (or rather always pessimistic) and that we add too much noise for a given guarantee, hurting the models utility.

Could you address this concern and whether it is possible to mitigate this using something like an adaptive clipping bound/noise during training? Or is there something I am missing?

Thanks in advance!

Darktex · March 16, 2021, 12:41am

Hi!

You are right: indeed the accountant does not consider the scale of gradients of each batch but this is intentional The TLDR answer is that if we allow ourselves to look at the norm of each gradient, we would know when a batch contains an outlier. Even if we do eventually clip it, this knowledge is a further source of information leak that would have to be addressed by our analysis.

If it’s helpful, let’s look at this from another angle: our guarantees will always have to be pessimistic, because by nature we need to protect the privacy of every sample, therefore we gotta make sure that whatever we do, we protect the privacy of each and every outlier. Let’s consider the DP definition and go back to the canonical example of an adversary that sees some outputs and needs to understand if they are seeing dataset D or dataset D’ (prime). They differ by a single example, so let’s say all examples are “easy” except from that one, which gets a huge gradient. This procedure of looking first and clipping later would output the same model, but now an adversary that sees the training loop as it trains would know who’s the outlier and beat us at the D vs D’ game.

That being said, this is potentially one of the things that we could do if we relaxed our guarantees by making less pessimistic assumptions on the capabilities of our adversary. It would be interesting to think about it, but to my knowledge it’s never been done