Why divide noise with batch size?

for p, clip_value in zip(params, clip_values):
    noise = self._generate_noise(clip_value, p)
    if self.loss_reduction == "mean":
        noise /= batch_size
    if self.rank == 0:
        p.grad += noise

the noise is added to the averaged gradient, why should it be divided by the batch size?
since the output is the final grad, should it be added with noise (without being divided) directly?

Compare two cases. If there’s no reduction (self.loss_reduction == “sum”), then we want to add noise calibrated to the clipping norm X. Indeed, let the gradients be g_1,…,g_B. The sum is g_1 + … + g_B, and to make it private the additive noise is sampled from the Gaussian distribution N(0, sigma^2 * C^2) so that it masks presence or absence of any one gradient vector.

If the reduction function is mean, then the output is (g_1 + … + g_B) / B. What should the additive noise be in this case? I think it’s pretty obvious that it is must be a scaled down (by a factor of B) noise from before. The only difference between the two cases is the scaling parameter, and it should be applied equally both to the sensitive inputs and the noise. Right?

1 Like

Cool, great thanks Ilya