Bug or feature? NaNs influence other variables in backprop

I am trying to use variable-length input by masking and padding with NaNs in order to quickly see masking errors if they happen.

But the backpropagation propagates the NaNs backward even if they are masked out

Example:

vec = torch.cuda.DoubleTensor([1,2,float(‘nan’)])
mask = torch.cuda.ByteTensor([1,1,0])
vec_var= Variable(vec)
scalar_var = Variable(torch.cuda.DoubleTensor([4]), requires_grad=True)
res_var = (scalar_var*vec_var)[mask].sum()
print(res_var)
res_var.backward()
print(scalar_var.grad)

Gives

Variable containing:
12
[torch.cuda.DoubleTensor of size 1 (GPU 0)]

Variable containing:
nan
[torch.cuda.DoubleTensor of size 1 (GPU 0)]

Why is that? In my view the NaNs at the end are not part of any calculation, they should not influence backpropagation at all.

Is it a bug or expected behavior?

Hi,

As far as I can tell, this is the expected behaviour.
The problem is when you perform this operation: res_var = scalar_var*vec_var.
If the gradients wrt res_var is g_res_var, then the formula for g_scalar_var is g_res_var * d(res_var)/d(scalar_var) = sum(g_res_var .* vec_var). In this case, vec_var contains a nan, and nan*0 = nan and the result of the sum will be nan as well.

I don’t understand the last bit
Where is the 0 that gets multiplied to the NaN?

In your case, since you mask out the last element of res_var, the gradient corresponding to it is going to be 0.
But it could be anything, multiplying it with nan will result in a nan anyway.

ok,
my feeling is that in that specific case 0*nan=nan is “philosophically” wrong since a masking operation should be stronger than anything else

I feel that I am going against some implementation decision, so I will just do a workaround, but I still do not think that it is correct… the conclusion is that nans should not be used as padding

If you actually use anything else than nan or inf, you will get the correct behaviour that you expect. Because they will be masked out the way you expect.