It seems connected to a thing I noted a while ago: Bug or feature? NaNs influence other variables in backprop
In that case there was a 0* inside the code of PyTorch that did 0*NaN and obtained NaN. The workaround for me has been to substitute NaNs with zeros.
Anyway, even if it’s expected behavior, I don’t feel the PyTorch implementation is the most reasonable