Torch.logsumexp returning nan gradients when inputs are -inf


When I try to backpropagate on a tensor full of -inf and I have a torch.logsumexp , the gradients of that becomes nan. Like the code below:

>>> a = torch.nn.Parameter(torch.tensor([-float("inf"), -float("inf"), -float("inf")]))
>>> b = 2 + a
>>> torch.logsumexp(a, dim=0)
tensor(-inf, grad_fn=<LogsumexpBackward>)
>>> torch.logsumexp(a, dim=0).backward()
>>> a.grad
tensor([nan, nan, nan])

But even if one of the elements in the tensor is non -inf the gradients are propagated properly:

>>> a = torch.nn.Parameter(torch.tensor([1.0, -float("inf"), -float("inf")]))
>>> b = 2 + a
>>> torch.logsumexp(a, dim=0)
tensor(1., grad_fn=<LogsumexpBackward>)
>>> torch.logsumexp(a, dim=0).backward()
>>> a.grad
tensor([1., 0., 0.])

How do I ensure that even in case of all -inf my gradients should be zero?

The problem is that at the point where the final result is -inf, the gradient is infinite. So during backprop, the gradient becomes nan.
In your second example, the gradient at point 1. is finite and everything works fine.

So I would say this is expected behavior no?

But in a larger graphical models like CTC or HMMs, this is introducing nans in my backpropagation, What could be the best possible way to avoid this behavior; as I want my model to be training on the non-infinity parameters in the different locations what I mean is that consider a matrix;

x = torch.tensor([[ 1.0, -float("inf")], [-float("inf"), -float("inf")]])
>>> x.shape
torch.Size([2, 2])
>>> x
tensor([[1., -inf],
        [-inf, -inf]])
>>> x = nn.Parameter(x)
>>> y = torch.logsumexp(x, dim=1)
>>> y
tensor([1., -inf], grad_fn=<LogsumexpBackward>)
>>> z = torch.sum(y, dim=0)
>>> z.backward()
>>> x.grad
tensor([[1., 0.],
        [nan, nan]])

Take this example for say; how can I make changes that instead of nan I get zero gradients at x.grad?

I would argue that you need to make sure not to get a whole row of infinites. Because at this point, your logsumexp function is not properly differentiable and so it’s gradients will be nan.

But incase I don’t want to do gradient update of those places where there are nan’s. What could be the possible work around this?

It depends on your optimizer.
If you don’t have momentum/accumulated terms, then you can simply set these gradients to 0 and your optimizer won’t change the values.
If you have a fancy optimizer that will update the weights even for a 0 gradient, the simplest solution might be to save the original value of the weights before performing the step and then restoring them after the optimizer step.

I think my use case is somewhat similar to that of this issue on Github:

Do we have any workaround for this?

It think what I mentioned in my previous message is the right thing.