Gradients before clip are much lager than the clip bound


I’m using Opacus to implement DPSGD in my program. One thing I noticed is that the gradients to be clipped are much larger than the max_grad_norm, as the screenshot from the console output.

The code is from the opacus tutorial opacus/building_image_classifier.ipynb at main · pytorch/opacus · GitHub I only modified the MAX_PHYSICAL_BATCH_SIZE to fit in my machine. I print the per_sample_norms in the clip_and_accumulate function in the DPOptimizer class of Opacus by following code,

    def clip_and_accumulate(self):
        per_param_norms = [g.view(len(g), -1).norm(2, dim=-1) for g in self.grad_samples]
        per_sample_norms = torch.stack(per_param_norms, dim=1).norm(2, dim=1)
        per_sample_clip_factor = (self.max_grad_norm / (per_sample_norms + 1e-6)).clamp(
            max=1.0 )

        print(f'80% gradient norms are less than {np.percentile(per_sample_norms.cpu(), 80)}')

        for p in self.params:
            grad_sample = _get_flat_grad_sample(p)
            grad = torch.einsum("i,i...", per_sample_clip_factor, grad_sample)

            if p.summed_grad is not None:
                p.summed_grad += grad
                p.summed_grad = grad


And get the following results, the per_sample_norms are much larger than the clip bound, which is C=1.2.

Can somebody tell me why? This is strange because the gradient before the clip is much larger than the clip bound.

Thanks in advance

Hi Zark,

I believe this is the correct behavior. The gradients before clipping can have any magnitude; it’s the responsibility of the clipping in DP-SGD to bring the gradients in the acceptable range. So, gradients before clipping are in arbitrary ranges. Once we clip them, they get smaller.

Does that make sense?

Thanks, Got it. And I’m wondering what makes a good choice of clip bound, maybe we should check the distribution of gradients and then select clip by percentage? like 80%

Hi @Zark

Great question. As far as I know, clip bound is a hyper parameter and we should treat it just like another hyper parameter, such as learning rate. The recent paper uses 0.1, 1, and 10 (Table 7). This paper also sheds light on choice of clip bound, how to do image classification with DP, so I think this is very relevant to your use case.

Best of luck

Hi ashkan,

thanks for sharing this paper. I’ll have a look at it :slight_smile:

1 Like

One more relevant paper to look at is Large Language Models Can Be Strong Differentially Private Learners. Authors explore wide variety of hyper-parameters for large language model fune-tuning task, including clipping norm (section 3.1.2)

tl;dr - clippig most (if not all gradients) might even be the desired behaviour, as it maximizes signal-to-noise ratio. Clipping itself shouldn’t have major effect on the model training, as the scale of the weights in most layers doesn’t matter and the model will adapt to the new scale as it trains

1 Like