I multiply the loss with a mask where some values are zero. In that case I expect the gradient for that position also should be zero. But I dont understand when I am getting -4 as the gradient at that position.

Iām not sure what the objective was of the code but the problem is in your loss function:

loss = (A - B.expand_as(A)) + 0.2

Imagine you want to minimize this then you would take A as negative as possible. Minimizing this loss thus corresponds to making the elements of A as negative as possible irrespective of the values of B.

Setting the loss to zero thus does not mean in this case that the gradient will be zero as the loss function can go below zero.

You will see that if you use:

loss = (A - B.expand_as(A))**2 + 0.2

Then the gradient will be zero, you can also omit the 0.2 as it will not influence the gradients in any way.