Perturbing the output of a loss function

Hi all,

As part of my team’s research, we are investigating applying a perturbation to the loss function of a neural network, to backpropagate using a noisy loss rather than the true loss. We are experimenting with normal noise with mean 0, and changing the standard deviation. We have implemented the following code:

loss_noisy = loss + np.random.normal(0, scale) * loss / loss.detach()


This adds the normalised loss vector to itself, with a random scaling, to emulate the equation

L = L + N(0, Scale)

We have produced results which show degraded performance of the network with increased scaling. However, we are unsure exactly how this is working under the hood. In particular, if a Loss value is composed of multiple samples, it is unclear how the perturbation is ‘distributed’ amongst the individual sample backpropagation steps.

I was hoping that someone might have some experience with this, and would understand how the perturbation is distributed during backpropagation.



1 Like

Hi Elliott!

Let’s first cut to the chase and understand how your perturbation
affects the gradients computed by .backward().

You have:

loss_noisy = (1 + fac) * loss

where fac is a random deviate divided by loss.detach().

loss.detach() is numerically equal to loss, but is not, itself,
differentiated in the backpropagation process.

The whole gradient-computation process is linear in the final loss
scalar, so if unperturbed_grad is the result you would have obtained
by backpropagating loss, backpropagating loss_noisy will give you:

some_parameter.grad = (1 + fac) * unperturbed_grad

As for how the perturbation is distributed among individual samples,
your loss will typically be a sum (or average) over samples:

loss = loss_batch = loss_samples.sum()

(where loss_samples is a vector of length nBatch of per-sample

If you were to compute:

loss_samples_noisy = (1 + fac) * loss_samples

(where fac does include the loss.detach() that has been summed
over the samples), then

loss_noisy = loss_samples_noisy.sum()

So your perturbation gets distributed over the per-sample losses simply
by multiplying them individually by the same linear factor you used to
produce loss_noisy.

As an aside, I would probably perturb (add noise to) the predictions
of your model or to your target values, rather than perturbing loss


K. Frank


Hi KFrank!

Thanks for the detailed response, this has given us a lot to chew on. The observation that

some_parameter.grad = (1 + fac) * unperturbed_grad

was incredibly helpful for us to investigate our problem.