Different behavior of x * h and x.mul_(h)

In my code, I have one operation that I need multiple x (an intermediate activation) with a scaler parameter h, so in my forward pass, I have some code like this:
def forward(x):
x = x * h

Alternatively, I tried with
def forward(x):
x.mul_(h)

So when I tested for one interation, I found both of the produce correct output and correct gradient. But if i train with them, I found the first one always better than the same one. Is there any difference between the two operations? (I used fp16, but I found the same behavior for fp32)

There is a small difference in how both of the methods work.

x*h does out-place multiplication , i,e it creates a new tensor which is a product of x and h.
x.mul(h) is in place ,i,e it replaces the original x variable with the product.

So , i think x.mul(h) , as it overwrites x , it is removing the gradient history of the model .
So , gradient calculation is not perfect in the second method

1 Like