I implemented a simple cnn architecture in Keras and PyTorch, and trained them using exactly the same hyperparameters on the same CIFAR10 data. However, the training behaved very differently. It only took about 1s per epoch for the Keras model, while for PyTorch model it was about 15 seconds. Overfitting happend in the Keras case but not in the PyTorch case (I did not use weight decay though). Also, PyTorch only used a small fraction of the GPU memory, while Keras occupied all GPU memory during the training. All experiments were done in the same machine with a single GeForce RTX 2080 Ti.
Here is part of the scripts for the torch and keras experiments. The full scripts could be found at https://gist.github.com/Xiuyu-Li/cd99c7d75e9b705c599d25b412593fed
PyTorch
def train(trainloader, model, criterion, optimizer, epoch, device):
model.train()
train_loss = 0
correct = 0
total = 0
for batch_idx, (inputs, targets) in enumerate(trainloader):
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
train_loss += loss.item() * targets.size(0)
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
return train_loss/total, 100.*correct/total
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = small_cnn(num_classes, num_conv)
model = model.to(device)
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)
criterion = nn.CrossEntropyLoss()
Keras
input_shape = x_train.shape[1:]
model = small_cnn(input_shape, num_classes, num_conv=num_conv)
optimizer = tf.keras.optimizers.SGD(lr=lr, momentum=momentum)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
model.summary()
model.fit(
x_train,
y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test),
shuffle=True)
I checked the correctness of the implemented model architectures, and it seems like they are the same:
PyTorch
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 30, 30] 896
MaxPool2d-2 [-1, 32, 15, 15] 0
Conv2d-3 [-1, 32, 13, 13] 9,248
MaxPool2d-4 [-1, 32, 6, 6] 0
Conv2d-5 [-1, 32, 4, 4] 9,248
MaxPool2d-6 [-1, 32, 2, 2] 0
Linear-7 [-1, 64] 8,256
Linear-8 [-1, 10] 650
================================================================
Total params: 28,298
Trainable params: 28,298
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 0.33
Params size (MB): 0.11
Estimated Total Size (MB): 0.45
----------------------------------------------------------------
Keras
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 30, 30, 32) 896
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 15, 15, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 13, 13, 32) 9248
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 6, 6, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 4, 4, 32) 9248
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 2, 2, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 128) 0
_________________________________________________________________
dense (Dense) (None, 64) 8256
_________________________________________________________________
dense_1 (Dense) (None, 10) 650
=================================================================
Total params: 28,298
Trainable params: 28,298
Non-trainable params: 0
_________________________________________________________________
What is causing these huge differences in training? Could it be related to something like the (potential) different implementations of optimizer and loss function in PyTorch and Keras?