Gradients computed by pytorch code is not the same as tensorflow code

I have converted a TensorFlow model into PyTorch model. Problem is, my PyTorch model does not perform as good as original TensorFlow model. So I did some inspection. I fixed both model’s weights, biases and input batches.
After finishing the same number of the epoch, the loss of TensorFlow model was very low (close to zero) but my PyTorch model plateaued at near 0.4. So I started inspecting the loss and noticed that the PyTorch’s loss start to deviate from the TensorFlow loss (after some epoch). That got me thinking maybe the change of weight at each epoch is not the same as TensorFlow. Because every intermediate output is the same at the first batch and at next few consecutive batches the intermediate outputs start to deviate very slowly. Then, I started to inspect the gradient and noticed that the gradient is not same from the starting batch.

Then I ran both models just for the first batch. Both models gave the same loss. But the problem happens when I calculate the gradient. The gradients of the models are not the same. Which is strange.

My question is, What might cause this discrepancy? Can anybody give any clue?

Here is a short and reproducible code for only one batch:

important things to notice:

Loss= -wd_loss+ 10* gradient_penalty

Notice in the output of both implementation that wd_loss and gradient_penalty are the same but the gradients of weights are different.

Please run and ask me if any problem.

This is the original paper:

Original code:

1 Like


Could it be because of the way you configure your optimizer? In particular weight decay in pytorch is not part of the .grad field.

I used the same setting without any weight decay.
But, is not gradient calculation is irrespective to optimizer?
the problem is before using the


I don’t know how Tensorflow does this, but weight decay can be considered as part of the gradients or not. So comparing the gradients won’t give you an accurate image of the step that will be taken.

Are there any methods to overcome this issue ? Like changing the Optimizer, Learning rate or something?
I am stuck at similar problem.

Yes you “just” need to implement the same thing.
In practice. If you can write down exactly on paper what is the model, loss, regularization, etc. Then you can just make sure you apply them all the same way. And that you measure the difference when all of them have been applied.

Also note that due to floating point imprecision, it is expected that you will get very small differences between the two frameworks (or even across versions of a single framework). These differences will be increase by the gradient descent leading to different models.
If you model is stable and converges properly for different random initializations (with different random seed), it will still converge properly to a similar solution.