Huge performance difference between Pytorch and Keras

I am working on a kaggle dataset, in one of the kernel, this guys implemented a CNN in keras with 93% validation accuracy. I tried to reproduce the structure in Pytorch. But my Pytorch version only got 70% accuray. Is there something that I missed from the Keras?

Here is the original keras code:

model = Models.Sequential()



SVG(model_to_dot(model).create(prog='dot', format='svg'))

Here is my Pytorch code:

import torch.nn as nn
import torch.nn.functional as F

# define the CNN architecture from this kaggle example
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # convolutional layer (sees 150x150x3 image tensor)112,56,28
        self.conv1 = nn.Conv2d(3, 200, 3)
        self.conv1 = nn.DataParallel(self.conv1)
        # convolutional layer (sees 16x16x16 tensor)
        self.conv2 = nn.Conv2d(200, 180, 3)
        self.conv2 = nn.DataParallel(self.conv2)

        # convolutional layer (sees 8x8x32 tensor)
        self.conv3 = nn.Conv2d(180, 180, 3)
        self.conv3 = nn.DataParallel(self.conv3)
        self.conv4 = nn.Conv2d(180, 140, 3)
        self.conv4 = nn.DataParallel(self.conv4)
        self.conv5 = nn.Conv2d(140, 100, 3)
        self.conv5 = nn.DataParallel(self.conv5)
        self.conv6 = nn.Conv2d(100, 50, 3)
        self.conv6 = nn.DataParallel(self.conv6)

        # max pooling layer
        self.pool = nn.MaxPool2d(5, 5)
        # linear layer (7 * 7 * 128 -> 1024)
        self.fc1 = nn.Linear(800, 180)
        self.fc1 = nn.DataParallel(self.fc1)
        # linear layer (500 -> 10)
        self.fc2 = nn.Linear(180, 100)
        self.fc2 = nn.DataParallel(self.fc2)
        self.fc3 = nn.Linear(100, 50)
        self.fc3 = nn.DataParallel(self.fc3)
        self.fc4 = nn.Linear(50, 6)
        # dropout layer (p=0.5)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        # add sequence of convolutional and max pooling layers
        x = F.relu(self.conv1(x))
        x = self.pool(F.relu(self.conv2(x)))
        x = F.relu(self.conv3(x))
        x = F.relu(self.conv4(x))
        x = F.relu(self.conv5(x))
        x = self.pool(F.relu(self.conv6(x)))

        # flatten image input
        x = x.view(-1, 800)
        # add dropout layer
        # add 1st hidden layer, with relu activation function
        x = F.relu(self.fc1(x))

        # add 2nd hidden layer, with relu activation function
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
         # add dropout layer
        x = self.dropout(x)
        x = self.fc4(x)
        return x

# create a complete CNN
model = Net()

# move tensors to GPU if CUDA is available'cuda')

import torch.optim as optim

# specify loss function (categorical cross-entropy)
criterion = nn.CrossEntropyLoss()

# specify optimizernnnn
optimizer = optim.Adam(model.parameters(), lr=1e-4)

May I ask why you wrap each layer with a DataParallel? From my experience this results in lots of scatter/ gather operations which is slowing down execution a lot.

You can just wrap the whole model after instantiating it.

E.g. after model = Net() you can add a line
model = nn.DataParallel(model)

Now regarding the different results. There might be different reasons.

  • Different weight initialization
  • Different learning rate schedule
  • Different preprocessing / data augmentation

Thanks for pointing that out. I didn’t know I can assign DataParallel to the whole Net before.

I tried couple of runs so I am sure it’s not about the weight initialization. The Keras version always went above 85% validation acc at the end, but Pytorch version is only aroud 70% acc.

What do you mean by different learning schedule?

I think I will look whatis the difference between data preprocessing next step.

I’m not too much into keras anymore. But I think it used to do some automated magic in terms of reducing learning rate if reaching a plateau.

But 15% in absolute difference is quite a lot.

And you can simplify your network code like this (there’s no need to handle all the layers manually during the forward pass):

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        features = []

        # convolutional layer (sees 150x150x3 image tensor)112,56,28
        features += [nn.Conv2d(3, 200, 3),
        # convolutional layer (sees 16x16x16 tensor)
        features += [nn.Conv2d(200, 180, 3),
        # convolutional layer (sees 8x8x32 tensor)
        features += [nn.Conv2d(180, 180, 3),
        features += [nn.Conv2d(180, 140, 3),
        features += [nn.Conv2d(140, 100, 3),
        features += [nn.Conv2d(100, 50, 3),
        # max pooling layer
        features += [nn.MaxPool2d(5, 5)]

        classification = []
        # linear layer (7 * 7 * 128 -> 1024)
        classification += [nn.Linear(800, 180),
        # linear layer (500 -> 10)
        classification += [nn.Linear(180, 100),
        classification += [nn.Linear(100, 50),
        classification += [nn.Linear(50, 6),
        # dropout layer (p=0.5)
        classification += [nn.Dropout(0.5)]

        self.features = nn.Sequential(*features)
        self.classification = nn.Sequential(*classification)

    def forward(self, x):
        # add sequence of convolutional and max pooling layers
        x = self.features(x)
        x = x.view(-1, 800)
        x = self.classification(x)
        return x

1 Like

It’s hard to see exactly what the differences are without seeing ‘everything’.


  • ensure image pre-processing is exactly the same (aside from color channel transpose)
  • ensure batch sizes are the same
  • ensure test/val splits are the same
  • ensure model mode is being switched between train() and evail() and back otherwise dropout remains active during eval
  • ensure any learning rate schedules are the same

Good to learn simpler way to define model here. Thanks.

I think maybe because I normalized the data?
Here is his way of importing data:

def get_images(directory):
    Images = []
    Labels = []  # 0 for Building , 1 for forest, 2 for glacier, 3 for mountain, 4 for Sea , 5 for Street
    label = 0
    for labels in os.listdir(directory): #Main Directory where each class label is present as folder name.
        if labels == 'glacier': #Folder contain Glacier Images get the '2' class label.
            label = 2
        elif labels == 'sea':
            label = 4
        elif labels == 'buildings':
            label = 0
        elif labels == 'forest':
            label = 1
        elif labels == 'street':
            label == 5
        elif labels == 'mountain':
            label == 3
        for image_file in os.listdir(directory+labels): #Extracting the file name of the image from Class Label folder
            image = cv2.imread(directory+labels+r'/'+image_file) #Reading the image (OpenCV)
            image = cv2.resize(image,(150,150)) #Resize the image, Some images are different sizes. (Resizing is very Important)
    return shuffle(Images,Labels,random_state=817328462) #Shuffle the dataset you just prepared.

def get_classlabel(class_code):
    labels = {2:'glacier', 4:'sea', 0:'buildings', 1:'forest', 5:'street', 3:'mountain'}
    return labels[class_code]

Images, Labels = get_images('../input/seg_train/seg_train/') #Extract the training images from the folders.

Images = np.array(Images) #converting the list of images to numpy array.
Labels = np.array(Labels)

He didn’t do any data preprocessing. However, I normalized my data and have a randomResize,

# number of subprocesses to use for data loading
num_workers = 0
# how many samples per batch to load
batch_size = 32
# percentage of training set to use as validation
valid_size = 0.2

# convert data to a normalized torch.FloatTensor
transform = transforms.Compose([
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))

# choose the training and test datasets

train_dir = 'seg_train'
test_dir = 'seg_test'

train_set = datasets.ImageFolder(train_dir, transform=transform)
test_set = datasets.ImageFolder(test_dir, transform=transform)

# obtain training indices that will be used for validation
num_train = len(train_set)
indices = list(range(num_train))
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

# prepare data loaders (combine dataset and sampler)
train_loader =, batch_size=batch_size,
    sampler=train_sampler, num_workers=num_workers)
valid_loader =, batch_size=batch_size, 
    sampler=valid_sampler, num_workers=num_workers)
test_loader =, batch_size=batch_size, 

# specify the image classes
classes = train_set.classes

Here is a link of the original Keras Kernel, if anyone want to try this out:

Ok, from looking at the kaggle kernel quite a lot is different:

  • Keras seems to use batch size of 1, your PyTorch code uses 32
  • Keras is not normalizing, PyTorch is using normalization
  • PyTorch is using RandomResizedCrop…

And here might be the issue, you don’t specify any parameters for RandomResizedCrop.
By default you end up using the following:

torchvision.transforms.RandomResizedCrop(size, scale=(0.08, 1.0), ratio=(0.75, 1.3333333333333333), interpolation=2)

From doc:
A crop of random size (default: of 0.08 to 1.0) of the original size and a random aspect ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This crop is finally resized to given size. This is popularly used to train the Inception networks.

Using default parameters is too strong. (0.08 of the image can be a part of not too much information for your CNN). I would just use the same resize method and drop normalization as in Keras.

transform = transforms.Compose([
    transforms.Resize((150, 150)),
1 Like

I took out the RandonResize yesterday while kept the normalization. I got 84% accuracy on test set after 80 epochs last night. This morning I tried training without the normalization, the model converged really really slow.

Epoch: 1 	Training Loss: 1.214028 	Validation Loss: 0.496487 	Validation Acc: 0.263420
Validation loss decreased (inf --> 0.496487).  Saving model ...
Epoch: 2 	Training Loss: 1.185056 	Validation Loss: 0.475918 	Validation Acc: 0.335629
Validation loss decreased (0.496487 --> 0.475918).  Saving model ...
Epoch: 3 	Training Loss: 1.185590 	Validation Loss: 0.479345 	Validation Acc: 0.276010
Epoch: 4 	Training Loss: 1.176353 	Validation Loss: 0.481123 	Validation Acc: 0.280760
Epoch: 5 	Training Loss: 1.180632 	Validation Loss: 0.478681 	Validation Acc: 0.288836
Epoch: 6 	Training Loss: 1.174871 	Validation Loss: 0.472889 	Validation Acc: 0.280998
Validation loss decreased (0.475918 --> 0.472889).  Saving model ...
Epoch: 7 	Training Loss: 1.174910 	Validation Loss: 0.485266 	Validation Acc: 0.279810
Epoch: 8 	Training Loss: 1.178336 	Validation Loss: 0.476574 	Validation Acc: 0.291449
Epoch: 9 	Training Loss: 1.174524 	Validation Loss: 0.477813 	Validation Acc: 0.294299
Epoch: 10 	Training Loss: 1.171263 	Validation Loss: 0.477734 	Validation Acc: 0.293587
Epoch: 11 	Training Loss: 1.168009 	Validation Loss: 0.476796 	Validation Acc: 0.286698
Epoch: 12 	Training Loss: 1.172514 	Validation Loss: 0.476565 	Validation Acc: 0.294062
Epoch: 13 	Training Loss: 1.173913 	Validation Loss: 0.473833 	Validation Acc: 0.291449
Epoch: 14 	Training Loss: 1.166061 	Validation Loss: 0.474874 	Validation Acc: 0.294299
Epoch: 15 	Training Loss: 1.171629 	Validation Loss: 0.481715 	Validation Acc: 0.292637
Epoch: 16 	Training Loss: 1.150386 	Validation Loss: 0.439620 	Validation Acc: 0.410451

It ended up with 70% test acc. I think the RandomResize was the real problem at the beginning.

However, the original Keras achieved 93% test accuracy. I tried the exact Keras code but only got 87% test acc. I guess that’s due to random initialization?

1 Like

Great to hear that you found the problem!

The reaming difference can be different initialization, but also due to different train/test split or learning rate schedule…

have you gone through the list of @rwightman?

  • ensure image pre-processing is exactly the same (aside from color channel transpose)
  • ensure batch sizes are the same
  • ensure test/val splits are the same
  • ensure model mode is being switched between train() and evail() and back otherwise dropout remains active during eval
  • ensure any learning rate schedules are the same

Yeah, I did go through the list:

I did the minimum pre-processing,
Keras has default batch size as 32, which is the size I am using right now.
I am using 30% validation split as the Keras model
I switch model mode between train() and eval()
Learning rate is 0.0001 ,1e-4

I will do more test on the Keras code to see if I can get a 93% test acc.

Also make sure the networks are being initialized the same way. Identical weights would be best, but similar parameter distributions should be enough.

1 Like

Pytorch has a lot less convenience for building „every-day“ architectures.

Initialization is more refined in keras. This is hugh. E.g. initialze in keras an copy weights, biases to pytorch via a mini wrapper function.

The optimizer may work slightly differently.

Use the same code to measure results, e.g. sklearn scores/measures. GPU metrics code can be funky. If you have label imbalances etc make sure to use the right measure. Micro, macro, roc etc.

Normalization: if you dont normalize and get better performance, then it may be that unormalized features with large values are important features. Or your network is underparameterized — then small value fearures get ignored, which in that case can be good. This happens quite often in practice.

Check that they original data split is not contaminated with train samples in the test set. This also happens a lot, e.g. with various versions of sklearn and gets fixed over time.

Lastly. Pytorch (and others) has its bugs. E.g. MSE was broken for a year or other losses like multilabelsoftmargin are still not easy to use — work other than expected.

So always check the submodules to work as expected.

Love pytorch but it is younger.

Thank you. Quite a simple way to define a network