Gradient difference between autograd and manual backprop computation


I implemented a custom RNN in two ways:

  1. Implement the forward pass, and rely on autograd for backprop
  2. Use the same forward pass, but write explicit, manually derived backprop

Code for 2) can be found below, although forward implementation for both 1) and 2) are identical:

When I trained both versions on a language modeling task with the same seeds, however, the losses for 1) and 2) are different. Furthermore, 1) converges to a lower loss than 2). I also confirmed that when I ran 1) repeatedly, I do get the same losses. Which means that the seedings are supposed to produce deterministic behavior.

I checked the hidden state returned by the forward passes from 1) and 2), it is the same. So I know the forward pass works the same way in both versions.

When I checked the gradients after backpropagating once using loss.backward(), however, there appears to be a small difference between the weight gradient tensors. The Max absolute difference between any two indices is 4.7683716e-07. My best guess is that the difference in loss convergence is due to this small difference, which adds up across multiple backpropagations.

However, when I checked the gradient of my custom backpropagation model using torch.auto_grad.gradcheck, it tells me its OK.

The ONLY other difference between the 2 versions is that in my custom backpropagation, I returned None for input gradient, since I don’t think it’s required for my purpose anyway.

What do you think is the root cause for the difference in loss convergence?
If you think it’s related to the small difference in gradient, how can I fix it?


The difference seems to come from the limited floating point precision.
Note that gradcheck will also use a tolerance value to check the gradients.

How reproducible is the convergence issue using these two approaches, i.e. are you able to get a lower loss using the first approach every time you run the code?