Poor Convergence on PyTorch compared to TensorFlow using Adam Optimizer

I am providing a reproducible problem to compare poor convergence of PyTorch model compared to Tensorflow for Adam optimizer with learning rate = 0.001, beta (0.9, 0.999), and elipson=1e-8.

The model is:

convLayer = nn.Conv1d(in_channel=24, out_channel=128, kernel_size=15,stride=1,padding=7)
convLayerGate = nn.Conv1d(in_channel=24, out_channel=128, kernel_size=15,stride=1,padding=7)
GLU = convLayer * Torch.sigmoid(convLayerGate)

Input to both PyTorch and Tensorflow (for comparison and reproducibility):

input = np.random.randn(1, 24, 128)
output = np.random.randn(1, 128, 128)

Loss in Tensorflow:

Loss in PyTorch

Tensorflow converges Loss to -> 0.0030888948
PyTorch Converges Loss to -> 0.012638479471

If one implements the model in Tensorflow one can spot the difference.

Did you make sure that you initialized the layers identically in pytorch and tf?

I used the default initialization for both PyTorch and Tensorflow. Even if the initialization scheme is not same, it seems unreasonable that Tensorflow converges to a loss which is 4 times better than what PyTorch converges to for such a simple model.

Initialization matters a lot… please try with same initial value.

To level the playing fields between both Tensorflow and PyTorch:

  1. Both are now using “Glorot Uniform” Initializer.
  2. Bias is turned off for both the models.

New Tensorlfow model loss function:

New PyTorch model loss function:

Tensorflow minimum loss: 0.00354829
PyTorch minimum loss: 0.011800370