Why are the scales of max_grad_norm different between loss_reduction='mean' and 'sum'

Amatubamem · March 18, 2026, 1:07pm

I have a question regarding the effective scale of the gradient clipping threshold when using the default loss_reduction='mean'.

According to the Opacus tutorial (“Building an Image Classifier with Differential Privacy”), max_grad_norm is defined as:

“Max Grad Norm: The maximum L2 norm of per-sample gradients before they are aggregated by the averaging step.”

However, my understanding of the current implementation is that when a PyTorch loss function uses reduction='mean' and make_private is called with loss_reduction='mean', the per-sample gradients captured by Opacus are already scaled down by 1/B (where B is the batch size) at the loss computation stage.

Opacus then appears to perform clipping directly on these 1/B-scaled gradients using the provided max_grad_norm (C). As a result, the effective clipping threshold relative to the raw, unscaled gradients becomes B times larger (B×C) than the theoretical definition.

I would like to ask for your insights on this behavior:

Is this the intended design? Should users manually scale down max_grad_norm by 1/B when using loss_reduction='mean' to align with the standard DP-SGD theory?
Or is Opacus supposed to automatically handle this scale correction internally so that max_grad_norm consistently applies to the unscaled gradients?

Thank you!