Inconsistancy in training between PyTorch and Keras under the same setting

I implemented a simple cnn architecture in Keras and PyTorch, and trained them using exactly the same hyperparameters on the same CIFAR10 data. However, the training behaved very differently. It only took about 1s per epoch for the Keras model, while for PyTorch model it was about 15 seconds. Overfitting happend in the Keras case but not in the PyTorch case (I did not use weight decay though). Also, PyTorch only used a small fraction of the GPU memory, while Keras occupied all GPU memory during the training. All experiments were done in the same machine with a single GeForce RTX 2080 Ti.

Here is part of the scripts for the torch and keras experiments. The full scripts could be found at https://gist.github.com/Xiuyu-Li/cd99c7d75e9b705c599d25b412593fed
PyTorch

def train(trainloader, model, criterion, optimizer, epoch, device):
    model.train()
    train_loss = 0
    correct = 0
    total = 0
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        train_loss += loss.item() * targets.size(0)
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

    return train_loss/total, 100.*correct/total

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = small_cnn(num_classes, num_conv)
model = model.to(device)

optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)
criterion = nn.CrossEntropyLoss()

Keras

input_shape = x_train.shape[1:]

model = small_cnn(input_shape, num_classes, num_conv=num_conv)

optimizer = tf.keras.optimizers.SGD(lr=lr, momentum=momentum)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)

model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
model.summary()
model.fit(
    x_train,
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(x_test, y_test),
    shuffle=True)

I checked the correctness of the implemented model architectures, and it seems like they are the same:
PyTorch

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             896
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 32, 13, 13]           9,248
         MaxPool2d-4             [-1, 32, 6, 6]               0
            Conv2d-5             [-1, 32, 4, 4]           9,248
         MaxPool2d-6             [-1, 32, 2, 2]               0
            Linear-7                   [-1, 64]           8,256
            Linear-8                   [-1, 10]             650
================================================================
Total params: 28,298
Trainable params: 28,298
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 0.33
Params size (MB): 0.11
Estimated Total Size (MB): 0.45
----------------------------------------------------------------

Keras

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 30, 30, 32)        896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 13, 13, 32)        9248      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 6, 6, 32)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 4, 4, 32)          9248      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 2, 2, 32)          0         
_________________________________________________________________
flatten (Flatten)            (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 10)                650       
=================================================================
Total params: 28,298
Trainable params: 28,298
Non-trainable params: 0
_________________________________________________________________

What is causing these huge differences in training? Could it be related to something like the (potential) different implementations of optimizer and loss function in PyTorch and Keras?

That’s the default behavior of TensorFlow and doesn’t mean that the complete GPU memory will be used.

By skimming through your code I cannot find any obvious issues, so you could load the Keras parameters into the PyTorch model and compare the output for a static input to make sure the architecture is equal.

Since you’re using SGD in both cases, the above should not be the case.

  1. Are the loss values aligned in both cases? Is the difference only in the train time and not in loss values?
  2. What’s your model like? Make sure that the model params are initialized to the same values (like constant values) for fair comparison.