I have a question about Gradient Clipping, that arises from the following principles of privacy accounting and DP-SGD:
The RDP calculation for each step in training is based on the ratio between maximum norm bound of the gradients and the std. deviation of the noise being added to them. This ratio is known as the noise multiplier. As long as the ratio stays the same, the privacy guarantee for a given real valued function does not change. So if I want to increase the maximum norm bound (sensitivity) of a real valued function, the noise std. dev. just has to be scaled by the same amount to satisfy the same privacy. #11 (see also Opacus Issue #11 , as well as Proposition 7 / Corollary 3 in https://arxiv.org/pdf/1702.07476.pdf)
Given this, I want to discuss the following example:
Suppose (for the sake of simplicity) that I have chosen a norm bound of B = 1, and that the corresponding noise std. dev. sigma is also 1. The noise multiplier z = sigma/B = 1, and this real valued function then satisfies (alpha, alpha / 2*z^2)-RDP.
Consider then the following two cases during training:
- If the gradient is of size 1 at a particular step in training, the noise fits the exact sensitivity of the gradient and the privacy is accounted for in a reasonable way.
- However, If the gradient is less than the norm bound, let’s say of size 0.5, the noise of scale 1 suddenly is too big for the now smaller sensitivity. As stated above, B and sigma could be scaled down to 0.5 as well to satisfy the same privacy guarantee as before. Worded differently, if for this step we would change B = 0.5 (which is just as valid a clipping bound as 1 and yields the same update to the gradient) but keep sigma = 1, this would satisfy a different privacy guarantee while providing the same update to the parameters (as having B=1, sigma=1). More specifically the guarantee should be equivalent to doubling the size of z resulting in alpha, alpha / 2*(2*z)^2) = (alpha, alpha / 8*z^2)-RDP.
My question now is, is there an obvious reason as to why this is not considered in privacy accounting? It does not seem to me that the accountant takes into consideration the actual scale of gradients or scales noise accordingly. The clipping threshold and noise multiplier are constant hyperparameters that are to be freely chosen by the user of Opacus. Because these are constant the noise that is added to the gradients is also always constant. As the sizes of the gradients most definitely are not, this leads me to believe that the privacy calculation sometimes would yield the first case for which the noise is correctly scaled and at other times the second case listed above for which the noise is not accurate (or rather always pessimistic) and that we add too much noise for a given guarantee, hurting the models utility.
Could you address this concern and whether it is possible to mitigate this using something like an adaptive clipping bound/noise during training? Or is there something I am missing?
Thanks in advance!