I would say that it’s expected, although in some cases, such as yours,
maybe not desirable.
The cause is that pytorch computes gradients during the backward
pass by applying the chain rule numerically.
Let’s look at the gradient of the slicing operation:
>>> import torch
>>> a = torch.ones (8, requires_grad = True)
tensor([1., 1., 1., 1., 1., 1., 1., 1.], requires_grad=True)
tensor([1., 1., 1., 1., 0., 0., 0., 0.])
That’s reasonable – the elements that were not included in the slice
have zero gradient.
We now backpropagate the gradient of the slicing operation through
the application of the
L. Simplifying things a bit, you
L.weight @ x. The gradient (more precisely, the
Hessian) of this expression (with respect to
L.weight) is, in essence,
g be the gradient of the slicing operation, the chain rule
gives us, more or less,
x @ g.
The problem is by the time
g gets to this step in the backpropagation,
autograd no longer knows that some elements were ignored by the
slicing operation – all it knows is that some elements of
0.0. (Perhaps those zeros were calculated numerically, e.g.,
2.0 - 6.0 / 3.0, rather than coming from an “ignore” operation.)
According to the highly desirable rules of floating-point arithmetic,
nan * 0.0 = nan. Autograd doesn’t know that it’s supposed to
x, rather, it’s supposed to multiply them by
and thus will get
nans as the results.
In a hypothetical world, floating-point numbers could have a special
ignore value, in addition to things like
inf. In such a
world, the gradient of the slicing operation could have
the gradient of the elements not included in the slice. Then, perhaps,
autograd could say that
nan * ignore = 0.0 (or something), and
you would get the result you were hoping for. But we don’t live in
that hypothetical world.