PyTorch Adam performs worse than Tensorflow Adam

Hey guys. I have checked similar posts on this matter & tried to dumb it down as much as possible, spent a few days, but still can’t figure this out. So would appreciate your help.

I was remaking a simple Sequential neural net from Tensorflow into PyTorch for binary text sentiment classification.

I dumbed it down to 5 samples of encoded and padded text.

Init variables code for both

batch_size = 5
num_epochs = 20
n_embeddings = 3000
embedding_dim = 16

X = [[2, 3, 4, 5, 6, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 8, 20, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [22, 23, 24, 8, 25, 26, 27, 28, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [29, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 8, 42, 43, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
y = [[1], [1], [1], [1], [1]]

TensorFlow code

import numpy as np
import tensorflow as tf

#----------------MODEL----------------
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(n_embeddings, embedding_dim),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
#----------------MODEL----------------


#----------------OPTIMIZER & LOSS----------------
model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam()
             )
#----------------OPTIMIZER & LOSS----------------


#----------------DATA----------------
# Prepare the data
train_x_prepared = np.array(X)
train_y_prepared = np.array(y)

print('The data is prepared for training!\n')
#----------------DATA----------------


#----------------TRAINING----------------
print('Training:')
history = model.fit(train_x_prepared, train_y_prepared, batch_size=batch_size, epochs=num_epochs)
#----------------TRAINING----------------

PyTorch code

import torch
from torch import optim

#----------------MODEL----------------
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(n_embeddings, embedding_dim)
        self.pooling = nn.AdaptiveAvgPool1d(1)
        self.fc = nn.Linear(embedding_dim, 1)
        self.activation = nn.Sigmoid()

    def forward(self, x):
        x = self.embedding(x)
        x = x.permute(0, 2, 1)
        x = self.pooling(x)
        x = x.squeeze(2)
        x = self.fc(x)
        x = self.activation(x)
        return x

torch_model = Net()
#----------------MODEL----------------


#----------------OPTIMIZER & LOSS----------------
criterion = nn.BCELoss()
optimizer = optim.Adam(torch_model.parameters(), eps=1e-07)
#----------------OPTIMIZER & LOSS----------------


#----------------DATA----------------
torch_train_x_prepared = torch.tensor(X).long()
torch_train_y_prepared = torch.tensor(y).float()

print('The data is prepared for training!\n')
#----------------DATA----------------


#----------------TRAINING----------------
print('Training:')

for epoch in range(num_epochs):
    running_loss = 0.0

    for i in range(0, len(torch_train_x_prepared), batch_size):
        batch_x = torch_train_x_prepared[i:i+batch_size]
        batch_y = torch_train_y_prepared[i:i+batch_size]
        
        optimizer.zero_grad()
        outputs = torch_model(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch: {epoch+1}/{num_epochs}, loss: {running_loss / (len(torch_train_x_prepared) / batch_size)}")

print("Training is finished")
#----------------TRAINING----------------

TensorFlow results

Training:
Epoch 1/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 1s 787ms/step - loss: 0.6921
Epoch 2/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 23ms/step - loss: 0.6723
Epoch 3/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step - loss: 0.6530
Epoch 4/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 28ms/step - loss: 0.6340
Epoch 5/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step - loss: 0.6153
Epoch 6/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - loss: 0.5970
Epoch 7/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - loss: 0.5789
Epoch 8/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - loss: 0.5612
Epoch 9/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 23ms/step - loss: 0.5437
Epoch 10/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 23ms/step - loss: 0.5266
Epoch 11/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step - loss: 0.5097
Epoch 12/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 26ms/step - loss: 0.4932
Epoch 13/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 27ms/step - loss: 0.4770
Epoch 14/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - loss: 0.4612
Epoch 15/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - loss: 0.4457
Epoch 16/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step - loss: 0.4305
Epoch 17/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step - loss: 0.4157
Epoch 18/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - loss: 0.4013
Epoch 19/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step - loss: 0.3872
Epoch 20/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step - loss: 0.3735

PyTorch results

Training:
Epoch: 1/20, loss: 0.7581332325935364
Epoch: 2/20, loss: 0.7515153884887695
Epoch: 3/20, loss: 0.7449377775192261
Epoch: 4/20, loss: 0.738400936126709
Epoch: 5/20, loss: 0.7319058179855347
Epoch: 6/20, loss: 0.7254530191421509
Epoch: 7/20, loss: 0.719042956829071
Epoch: 8/20, loss: 0.712676465511322
Epoch: 9/20, loss: 0.706354022026062
Epoch: 10/20, loss: 0.7000761032104492
Epoch: 11/20, loss: 0.6938431859016418
Epoch: 12/20, loss: 0.6876559257507324
Epoch: 13/20, loss: 0.6815144419670105
Epoch: 14/20, loss: 0.6754195690155029
Epoch: 15/20, loss: 0.6693712472915649
Epoch: 16/20, loss: 0.6633699536323547
Epoch: 17/20, loss: 0.6574161648750305
Epoch: 18/20, loss: 0.6515097618103027
Epoch: 19/20, loss: 0.6456514596939087
Epoch: 20/20, loss: 0.6398409605026245

P.S. I understand that the starting epoch 1 loss can vary due to random weight initialization, but notice the convergence, TensorFlow Adam for some reason converges much faster, and this is only a test on 5 samples, with a practical sample the difference is insane.

2 Likes

Could you set the weights to be the same to rule our weight initialization effects?

Here:

TensorFlow

Training:
Epoch 1/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 1s 881ms/step - loss: 0.1064
Epoch 2/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step - loss: 0.1006
Epoch 3/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step - loss: 0.0951
Epoch 4/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 27ms/step - loss: 0.0898
Epoch 5/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 24ms/step - loss: 0.0848
Epoch 6/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 24ms/step - loss: 0.0802
Epoch 7/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step - loss: 0.0757
Epoch 8/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 26ms/step - loss: 0.0716
Epoch 9/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 26ms/step - loss: 0.0676
Epoch 10/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 26ms/step - loss: 0.0639
Epoch 11/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 27ms/step - loss: 0.0605
Epoch 12/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 26ms/step - loss: 0.0572
Epoch 13/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step - loss: 0.0542
Epoch 14/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step - loss: 0.0513
Epoch 15/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step - loss: 0.0486
Epoch 16/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step - loss: 0.0461
Epoch 17/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step - loss: 0.0437
Epoch 18/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 27ms/step - loss: 0.0415
Epoch 19/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step - loss: 0.0395
Epoch 20/20
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 24ms/step - loss: 0.0375

PyTorch

Training:
Epoch: 1/20, loss: 0.10643015801906586
Epoch: 2/20, loss: 0.1055297702550888
Epoch: 3/20, loss: 0.10463515669107437
Epoch: 4/20, loss: 0.10374639183282852
Epoch: 5/20, loss: 0.10286366939544678
Epoch: 6/20, loss: 0.10198686271905899
Epoch: 7/20, loss: 0.10111621767282486
Epoch: 8/20, loss: 0.1002516895532608
Epoch: 9/20, loss: 0.09939347207546234
Epoch: 10/20, loss: 0.09854154288768768
Epoch: 11/20, loss: 0.09769599139690399
Epoch: 12/20, loss: 0.09685685485601425
Epoch: 13/20, loss: 0.09602418541908264
Epoch: 14/20, loss: 0.09519802033901215
Epoch: 15/20, loss: 0.09437845647335052
Epoch: 16/20, loss: 0.09356546401977539
Epoch: 17/20, loss: 0.09275911748409271
Epoch: 18/20, loss: 0.09195945411920547
Epoch: 19/20, loss: 0.09116645902395248
Epoch: 20/20, loss: 0.09038019925355911

Again, the optimizers are both Adam, nothing is changed there. You can see the difference in convergence.

The Dense layer in Tensorflow, and the nn.Linear in PyTorch are defined differently. How do you account for this in your network initialization? They’re defined as follows,

TF: x @ A + b
PyTorch: x @ A^T + b

Yes, I transpose my weights back:

# Set the embedding weights
pt_model.embedding.weight.data = torch.from_numpy(embedding_weights)

# Set the linear layer weights and bias
pt_model.fc.weight.data = torch.from_numpy(kernel_weights.T)
pt_model.fc.bias.data = torch.from_numpy(kernel_biases)

Does anyone have any idea why it’s happening?

I’m curious if this could be due to BCE loss defaults being different. The default reduction for tf is sum_over_batch_size: tf.keras.losses.BinaryCrossentropy  |  TensorFlow v2.16.1 while the default for pytorch is a mean: BCELoss β€” PyTorch 2.4 documentation.

I can guess this would make the effective tf example learning rate much higher, which may explain why it looks like it converges faster in epoch 1.

Hey Jane, I thought about this before posting here, I set both BCE losses reduction to β€œsum” and it doesn’t fix the difference.

I’ve put your code in a colab to repro. The formulation of loss in the PyTorch repro looks off, but after I switched to sum, the increments for loss have widened compared to before. See Google Colab

Either way, it is hard to reason about whether this discrepancy is in Adam or due to any of the previous layers in this large e2e repro. If the discrepancy were in the optimizer step, one way to confirm is to have the same weights and grads go into the tf and pt Adam and see if there’s significant difference.

Just came back to this and decided to try to figure this out one more time.

And so by looking back, the default reduce param is in fact the same in both versions.
Tensorflow’s β€œsum_over_batch_size” means β€œsum all of the elements and divide by the batch size” so it basically is the same as PyTorch’s default β€œmean”.

That’s why I said that when I tried changing both reduction parameters in TF and PT to β€œsum” the comparable differences were pretty much the same. You see a different picture because you change only PT’s reduction to β€œsum”, but you don’t change the TF reduction to β€œsum”, it still stays β€œsum_over_batch_size”, which is the β€œmean”.

Anyway, I will try to figure this out and if I will, I will post about it.