In my code, I have one operation that I need multiple x (an intermediate activation) with a scaler parameter h, so in my forward pass, I have some code like this:
def forward(x):
x = x * h
Alternatively, I tried with
def forward(x):
x.mul_(h)
So when I tested for one interation, I found both of the produce correct output and correct gradient. But if i train with them, I found the first one always better than the same one. Is there any difference between the two operations? (I used fp16, but I found the same behavior for fp32)