As far as I can tell, this is the expected behaviour.
The problem is when you perform this operation: res_var = scalar_var*vec_var.
If the gradients wrt res_var is g_res_var, then the formula for g_scalar_var is g_res_var * d(res_var)/d(scalar_var) = sum(g_res_var .* vec_var). In this case, vec_var contains a nan, and nan*0 = nan and the result of the sum will be nan as well.
In your case, since you mask out the last element of res_var, the gradient corresponding to it is going to be 0.
But it could be anything, multiplying it with nan will result in a nan anyway.
ok,
my feeling is that in that specific case 0*nan=nan is “philosophically” wrong since a masking operation should be stronger than anything else
I feel that I am going against some implementation decision, so I will just do a workaround, but I still do not think that it is correct… the conclusion is that nans should not be used as padding
If you actually use anything else than nan or inf, you will get the correct behaviour that you expect. Because they will be masked out the way you expect.