I have converted a TensorFlow model into PyTorch model. Problem is, my PyTorch model does not perform as good as original TensorFlow model. So I did some inspection. I fixed both model’s weights, biases and input batches.

After finishing the same number of the epoch, the loss of TensorFlow model was very low (close to zero) but my PyTorch model plateaued at near 0.4. So I started inspecting the loss and noticed that the PyTorch’s loss start to deviate from the TensorFlow loss (after some epoch). That got me thinking maybe the change of weight at each epoch is not the same as TensorFlow. Because every intermediate output is the same at the first batch and at next few consecutive batches the intermediate outputs start to deviate very slowly. Then, I started to inspect the gradient and noticed that the gradient is not same from the starting batch.

Then I ran both models just for the first batch. Both models gave the same loss. But the problem happens when I calculate the gradient. The gradients of the models are not the same. Which is strange.

My question is, What might cause this discrepancy? Can anybody give any clue?

Here is a short and reproducible code for only one batch:

https://drive.google.com/file/d/1igufiDJIG8Q1dQ0g8em0ec9GMXU8MUb9/view?usp=sharing

**important things to notice**:

Loss= -wd_loss+ 10* gradient_penalty

Notice in the output of both implementation that wd_loss and gradient_penalty are the same but the gradients of weights are different.

Please run and ask me if any problem.

This is the original paper:

https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/17155

Original code: https://github.com/RockySJ/WDGRL/blob/master/toy/wd.py