The problem is that at the point where the final result is -inf, the gradient is infinite. So during backprop, the gradient becomes nan.
In your second example, the gradient at point 1. is finite and everything works fine.
But in a larger graphical models like CTC or HMMs, this is introducing nans in my backpropagation, What could be the best possible way to avoid this behavior; as I want my model to be training on the non-infinity parameters in the different locations what I mean is that consider a matrix;
x = torch.tensor([[ 1.0, -float("inf")], [-float("inf"), -float("inf")]])
>>> x.shape
torch.Size([2, 2])
>>> x
tensor([[1., -inf],
[-inf, -inf]])
>>> x = nn.Parameter(x)
>>> y = torch.logsumexp(x, dim=1)
>>> y
tensor([1., -inf], grad_fn=<LogsumexpBackward>)
>>> z = torch.sum(y, dim=0)
>>> z.backward()
>>> x.grad
tensor([[1., 0.],
[nan, nan]])
Take this example for say; how can I make changes that instead of nan I get zero gradients at x.grad?
I would argue that you need to make sure not to get a whole row of infinites. Because at this point, your logsumexp function is not properly differentiable and so it’s gradients will be nan.
It depends on your optimizer.
If you don’t have momentum/accumulated terms, then you can simply set these gradients to 0 and your optimizer won’t change the values.
If you have a fancy optimizer that will update the weights even for a 0 gradient, the simplest solution might be to save the original value of the weights before performing the step and then restoring them after the optimizer step.