Correctly generating multivariate Gaussian random variable

I am generating a multivariate random variable from a Gaussian distribution with known mean and variance in the following two ways:

for t in range(T):
    samples[t, ...] = torch.normal(mean=means, std=torch.sqrt(variances))  # method 1
    samples[t, ...] = means + torch.mul(torch.sqrt(variances), torch.randn_like(means))  # method 2

I then perform training of a neural network based on samples as predictions and known targets. The losses that I get with these two methods and making no other changes are very different. If I use method 1, then the loss loss does not increase or decrease, i.e., the network does not learn anything. If I use method 2, the loss becomes nan.

According to my understanding, the two methods in the code snippet above are equivalent. I am not sure why the behaviour of the training changes so drastically. Is it a side effect of the loss, or are the two methods not equivalent?

Hi Laya!

The two methods are not equivalent in that they do not handle
backpropagation equivalently.

Specifically, torch.normal() doesn’t really backpropagate (although
it acts like it does). You can argue – as I suppose pytorch does – that
torch.normal() shouldn’t backpropagate because there is no single,
clear-cut definition of what it should mean to backpropagate through a
sample from a probability distribution. (I will say that it’s arguably a bug
that the output of torch.normal() carries a grad_fn and produces
gradients (that are zero) upon backpropagation. To me it would make
more sense for it simply not to support backpropagation.)

In “method 2” you can backpropagate just fine because means and
variances only enter into standard differentiable functions – the output of
torch.randn_like() are just “constants” (with requires_grad = False)
as far as the computation involving means and variances is concerned.

Consider:

>>> import torch
>>> print (torch.__version__)
1.13.0
>>>
>>> _ = torch.manual_seed (2022)
>>>
>>> means = torch.randn (5, requires_grad = True)
>>> vars = torch.rand (5, requires_grad = True)
>>>
>>> sample1 = torch.normal (mean = means, std = torch.sqrt (vars))              # method 1
>>> sample2 = means + torch.mul (torch.sqrt (vars), torch.randn_like (means))   # method 2
>>>
>>> sample1                    # has grad_fn
tensor([-0.6208,  0.1995, -2.8783,  0.6203, -0.9839],
       grad_fn=<NormalBackward3>)
>>> sample1.sum().backward()   # normal() pretends to backpropagate ...
>>> means.grad                 # but with zero grad
tensor([0., 0., 0., 0., 0.])
>>> vars.grad
tensor([0., 0., 0., 0., 0.])
>>>
>>> sample2                    # also has grad_fn
tensor([ 0.8952,  1.2811, -0.9495,  3.2830, -0.4096], grad_fn=<AddBackward0>)
>>> sample2.sum().backward()   # and backpropagates "properly"
>>> means.grad                 # meaningful grad
tensor([1., 1., 1., 1., 1.])
>>> vars.grad
tensor([ 0.4397,  0.6463, -0.7891,  2.1393, -0.1648])

Whether this difference in backpropagation explains your loss blowing
up would depend on the details of the rest of your computation. But with
different gradients the training you get with the two methods could well
wander off in different directions with one of them becoming unstable while
the other doesn’t.

Best.

K. Frank

@KFrank Thank you very much for the explanation. That makes a lot of sense. I made a small change in the post - I was actually getting no improvement in loss with method 1 (it makes sense now since the grad is 0). I set torch.autograd.set_detect_anomaly(True) in my training routine and used method 2. The loss actually becomes nan and does not blow up (as mentioned earlier in the post). I get the following error:
RuntimeError: Function 'SqrtBackward0' returned nan values in its 0th output.

I think the error comes from the generation of the samples with method 2 that takes square root of the variances. The variable variances is generated from a Linear layer followed by ReLU activation, so it is never negative. I am now not sure why the loss is becoming nan.

To avoid this situation, I removed the torch.sqrt() and just used:
samples[t, ...] = means + torch.mul(variances, torch.randn_like(means))

The network now learns to predict the standard deviations (although the variable is named variances). When I do this, I get the following error:
RuntimeError: Function 'LogBackward0' returned nan values in its 0th output.

I am using a stochastic classification loss (eq 11 of this paper: link) that performs torch.log operations. I have the following implementation of the loss function:

class HeteroscedasticLoss(torch.nn.Module):
    def __init__(self, T=100, batch_size=128):
        super().__init__()
        self.T = T
        self.batch_size = batch_size

    def forward(self, outputs, targets):
                outputs = torch.swapaxes(outputs, 1, 0)  # convert to [batch_size, T, ..., num_classes]
        probs = torch.zeros_like(outputs, device=targets.device)
        i = 0
        for x in outputs:
            probs[i, ...] = torch.softmax(x, -1)
            i = i + 1
        probs = torch.mean(probs, 1)
        loss = torch.nn.functional.nll_loss(torch.log(probs), targets)
        return loss

I also used a different implementation (eq 12 of the paper):

class HeteroscedasticLoss(torch.nn.Module):
    def __init__(self, T=100, batch_size=128):
        super().__init__()
        self.T = T
        self.batch_size = batch_size

    def forward(self, outputs, targets):
        outputs = torch.swapaxes(outputs, 1, 0)  # convert to [batch_size, T, ..., num_classes]
        loss = 0
        i = 0
        for x in outputs:
            lsm = torch.logsumexp(input=x, dim=-1)  # [T,]
            arg = torch.sum(torch.exp(x[..., targets[i].item()] - lsm)) / self.T  # scalar
            loss = loss + torch.log(arg)
            i = i + 1
        loss = loss / self.batch_size
        return loss

In this case, the loss goes to -inf. I have changed the optimizer and learning rate, but I get the same error - only a little later with small learning rates. I am not sure now why this is happening and how to address this. Is there any way to handle this issue?

Hi Laya!

I can’t tell from what you’ve posted what the problem might be. You could
have a bug somewhere or your training hyperparameters might be too
“aggressive,” so your training might be becoming unstable.

(Consider starting with plain-vanilla SGD with a small learning rate, and
if you can get that to train stably, try adding momentum.)

However …

This may be the core cause of your problem. You are trying to train a
parameter (the variance, or perhaps standard deviation) that is naturally
positive. It is true that ReLU will clamp the parameter to be non-negative,
but something abrupt (and technically singular) is happening right at zero.

To me a very small variance is “atypical” and an even smaller variance
is even more “atypical,” but the output of Linear doesn’t know that getting
close to zero is “atypical” and Linear can easily output negative values.

Consider passing the output of Linear through something like Softplus to
get your variance. This maps the entire (-inf, inf) range of Linear to
(0.0, inf) and does so in a way that seems to fit the “meaning” of a
variance in the following sense:

A very large variance is “atypical” and would be achieved by having a very
large positive output from Linear. A very small variance (that is, positive,
but very close to zero) would be achieved by having a very large negative
output from Linear. Nothing abrupt nor singular occurs when the output
of Linear happens to be zero. (It is true that you get singular behavior as
the variance approaches zero, but this happens when the output of Linear
approaches -inf, which is naturally a singular point.)

As long as your loss function penalizes small variances (skinny Gaussians),
even if only weakly, it is unlikely that optimization will push the output of
Linear to be a large enough negative value for the “singular” behavior of
a variance near zero to be a problem.

(If you are also training the mean of your Gaussian, using the output of a
Linear – without any further transformation – should be perfectly fine.)

Best.

K. Frank

@KFrank I tried changing the optimizer, learning rate, activation for predictions variances. Nothing seems to work at the moment. I get the error RuntimeError: Function 'ExpBackward0' returned nan values in its 0th output. now. I tried out a few checks and am still not able to figure out the problem. Thank you for your explanation and suggestions.