Perturbing the output of a loss function

elliotto · June 9, 2022, 5:23am

Hi all,

As part of my team’s research, we are investigating applying a perturbation to the loss function of a neural network, to backpropagate using a noisy loss rather than the true loss. We are experimenting with normal noise with mean 0, and changing the standard deviation. We have implemented the following code:

loss_noisy = loss + np.random.normal(0, scale) * loss / loss.detach()

loss_noisy.backward()

This adds the normalised loss vector to itself, with a random scaling, to emulate the equation

L = L + N(0, Scale)

We have produced results which show degraded performance of the network with increased scaling. However, we are unsure exactly how this is working under the hood. In particular, if a Loss value is composed of multiple samples, it is unclear how the perturbation is ‘distributed’ amongst the individual sample backpropagation steps.

I was hoping that someone might have some experience with this, and would understand how the perturbation is distributed during backpropagation.

Regards,

Elliott

KFrank · June 9, 2022, 2:23pm

Hi Elliott!

Let’s first cut to the chase and understand how your perturbation
affects the gradients computed by .backward().

You have:

loss_noisy = (1 + fac) * loss

where fac is a random deviate divided by loss.detach().

loss.detach() is numerically equal to loss, but is not, itself,
differentiated in the backpropagation process.

The whole gradient-computation process is linear in the final loss
scalar, so if unperturbed_grad is the result you would have obtained
by backpropagating loss, backpropagating loss_noisy will give you:

some_parameter.grad = (1 + fac) * unperturbed_grad

As for how the perturbation is distributed among individual samples,
your loss will typically be a sum (or average) over samples:

loss = loss_batch = loss_samples.sum()

(where loss_samples is a vector of length nBatch of per-sample
losses).

If you were to compute:

loss_samples_noisy = (1 + fac) * loss_samples

(where fac does include the loss.detach() that has been summed
over the samples), then

loss_noisy = loss_samples_noisy.sum()

So your perturbation gets distributed over the per-sample losses simply
by multiplying them individually by the same linear factor you used to
produce loss_noisy.

As an aside, I would probably perturb (add noise to) the predictions
of your model or to your target values, rather than perturbing loss
directly.

Best.

K. Frank

elliotto · June 23, 2022, 5:39am

Hi KFrank!

Thanks for the detailed response, this has given us a lot to chew on. The observation that

some_parameter.grad = (1 + fac) * unperturbed_grad

was incredibly helpful for us to investigate our problem.

Regards,
Elliott