Change the gradient inplace problem in loss.backward()

Besides what @AlphaBetaGamma96 explained you might also be running into this issue where stale forward activations are used to calculate gradients with the already (inplace) updated parameters. Could you check if your use case is similar or the same?