# Correctly generating multivariate Gaussian random variable

I am generating a multivariate random variable from a Gaussian distribution with known mean and variance in the following two ways:

``````for t in range(T):
samples[t, ...] = torch.normal(mean=means, std=torch.sqrt(variances))  # method 1
samples[t, ...] = means + torch.mul(torch.sqrt(variances), torch.randn_like(means))  # method 2
``````

I then perform training of a neural network based on `samples` as predictions and known `targets`. The losses that I get with these two methods and making no other changes are very different. If I use method 1, then the loss loss does not increase or decrease, i.e., the network does not learn anything. If I use method 2, the loss becomes `nan`.

According to my understanding, the two methods in the code snippet above are equivalent. I am not sure why the behaviour of the training changes so drastically. Is it a side effect of the loss, or are the two methods not equivalent?

Hi Laya!

The two methods are not equivalent in that they do not handle
backpropagation equivalently.

Specifically, `torch.normal()` doesn’t really backpropagate (although
it acts like it does). You can argue – as I suppose pytorch does – that
`torch.normal()` shouldn’t backpropagate because there is no single,
clear-cut definition of what it should mean to backpropagate through a
sample from a probability distribution. (I will say that it’s arguably a bug
that the output of `torch.normal()` carries a `grad_fn` and produces
gradients (that are zero) upon backpropagation. To me it would make
more sense for it simply not to support backpropagation.)

In “method 2” you can backpropagate just fine because `means` and
`variances` only enter into standard differentiable functions – the output of
`torch.randn_like()` are just “constants” (with `requires_grad = False`)
as far as the computation involving `means` and `variances` is concerned.

Consider:

``````>>> import torch
>>> print (torch.__version__)
1.13.0
>>>
>>> _ = torch.manual_seed (2022)
>>>
>>> means = torch.randn (5, requires_grad = True)
>>> vars = torch.rand (5, requires_grad = True)
>>>
>>> sample1 = torch.normal (mean = means, std = torch.sqrt (vars))              # method 1
>>> sample2 = means + torch.mul (torch.sqrt (vars), torch.randn_like (means))   # method 2
>>>
tensor([-0.6208,  0.1995, -2.8783,  0.6203, -0.9839],
>>> sample1.sum().backward()   # normal() pretends to backpropagate ...
tensor([0., 0., 0., 0., 0.])
tensor([0., 0., 0., 0., 0.])
>>>
>>> sample2                    # also has grad_fn
>>> sample2.sum().backward()   # and backpropagates "properly"
tensor([1., 1., 1., 1., 1.])
tensor([ 0.4397,  0.6463, -0.7891,  2.1393, -0.1648])
``````

Whether this difference in backpropagation explains your loss blowing
up would depend on the details of the rest of your computation. But with
different gradients the training you get with the two methods could well
wander off in different directions with one of them becoming unstable while
the other doesn’t.

Best.

K. Frank

@KFrank Thank you very much for the explanation. That makes a lot of sense. I made a small change in the post - I was actually getting no improvement in loss with method 1 (it makes sense now since the grad is 0). I set `torch.autograd.set_detect_anomaly(True)` in my training routine and used method 2. The loss actually becomes `nan` and does not blow up (as mentioned earlier in the post). I get the following error:
`RuntimeError: Function 'SqrtBackward0' returned nan values in its 0th output.`

I think the error comes from the generation of the samples with method 2 that takes square root of the variances. The variable `variances` is generated from a `Linear` layer followed by `ReLU` activation, so it is never negative. I am now not sure why the loss is becoming `nan`.

To avoid this situation, I removed the `torch.sqrt()` and just used:
`samples[t, ...] = means + torch.mul(variances, torch.randn_like(means))`

The network now learns to predict the standard deviations (although the variable is named `variances`). When I do this, I get the following error:
`RuntimeError: Function 'LogBackward0' returned nan values in its 0th output.`

I am using a stochastic classification loss (eq 11 of this paper: link) that performs `torch.log` operations. I have the following implementation of the loss function:

``````class HeteroscedasticLoss(torch.nn.Module):
def __init__(self, T=100, batch_size=128):
super().__init__()
self.T = T
self.batch_size = batch_size

def forward(self, outputs, targets):
outputs = torch.swapaxes(outputs, 1, 0)  # convert to [batch_size, T, ..., num_classes]
probs = torch.zeros_like(outputs, device=targets.device)
i = 0
for x in outputs:
probs[i, ...] = torch.softmax(x, -1)
i = i + 1
probs = torch.mean(probs, 1)
loss = torch.nn.functional.nll_loss(torch.log(probs), targets)
return loss
``````

I also used a different implementation (eq 12 of the paper):

``````class HeteroscedasticLoss(torch.nn.Module):
def __init__(self, T=100, batch_size=128):
super().__init__()
self.T = T
self.batch_size = batch_size

def forward(self, outputs, targets):
outputs = torch.swapaxes(outputs, 1, 0)  # convert to [batch_size, T, ..., num_classes]
loss = 0
i = 0
for x in outputs:
lsm = torch.logsumexp(input=x, dim=-1)  # [T,]
arg = torch.sum(torch.exp(x[..., targets[i].item()] - lsm)) / self.T  # scalar
loss = loss + torch.log(arg)
i = i + 1
loss = loss / self.batch_size
return loss
``````

In this case, the loss goes to `-inf`. I have changed the optimizer and learning rate, but I get the same error - only a little later with small learning rates. I am not sure now why this is happening and how to address this. Is there any way to handle this issue?

Hi Laya!

I can’t tell from what you’ve posted what the problem might be. You could
have a bug somewhere or your training hyperparameters might be too
“aggressive,” so your training might be becoming unstable.

(Consider starting with plain-vanilla `SGD` with a small learning rate, and
if you can get that to train stably, try adding momentum.)

However …

This may be the core cause of your problem. You are trying to train a
parameter (the variance, or perhaps standard deviation) that is naturally
positive. It is true that `ReLU` will clamp the parameter to be non-negative,
but something abrupt (and technically singular) is happening right at zero.

To me a very small variance is “atypical” and an even smaller variance
is even more “atypical,” but the output of `Linear` doesn’t know that getting
close to zero is “atypical” and `Linear` can easily output negative values.

Consider passing the output of `Linear` through something like Softplus to
get your variance. This maps the entire `(-inf, inf)` range of `Linear` to
`(0.0, inf)` and does so in a way that seems to fit the “meaning” of a
variance in the following sense:

A very large variance is “atypical” and would be achieved by having a very
large positive output from `Linear`. A very small variance (that is, positive,
but very close to zero) would be achieved by having a very large negative
output from `Linear`. Nothing abrupt nor singular occurs when the output
of `Linear` happens to be zero. (It is true that you get singular behavior as
the variance approaches zero, but this happens when the output of `Linear`
approaches `-inf`, which is naturally a singular point.)

As long as your loss function penalizes small variances (skinny Gaussians),
even if only weakly, it is unlikely that optimization will push the output of
`Linear` to be a large enough negative value for the “singular” behavior of
a variance near zero to be a problem.

(If you are also training the mean of your Gaussian, using the output of a
`Linear` – without any further transformation – should be perfectly fine.)

Best.

K. Frank

@KFrank I tried changing the optimizer, learning rate, activation for predictions variances. Nothing seems to work at the moment. I get the error `RuntimeError: Function 'ExpBackward0' returned nan values in its 0th output.` now. I tried out a few checks and am still not able to figure out the problem. Thank you for your explanation and suggestions.