Why do I get much worse results by using pytorch model than using keras

I’m trying to move from keras to pytorch, however with the same model and same optimizer with keras and pytorch, I always get much worse results by using pytorch, any idea why I get this strange results? Thanks a lot for your help.

The keras model what I’m using is:

def cnn_best(input_shape, classes):
    # From VGG16 design
    input_shape = (700, 1)
    img_input = Input(shape=input_shape)
    # Block 1
    x = Conv1D(64, 11, activation='relu', padding='same', name='block1_conv1')(img_input)
    x = AveragePooling1D(2, strides=2, name='block1_pool')(x)
    # Block 2
    x = Conv1D(128, 11, activation='relu', padding='same', name='block2_conv1')(x)
    x = AveragePooling1D(2, strides=2, name='block2_pool')(x)
    # Block 3
    x = Conv1D(256, 11, activation='relu', padding='same', name='block3_conv1')(x)
    x = AveragePooling1D(2, strides=2, name='block3_pool')(x)
    # Block 4
    x = Conv1D(512, 11, activation='relu', padding='same', name='block4_conv1')(x)
    x = AveragePooling1D(2, strides=2, name='block4_pool')(x)
    # Block 5
    x = Conv1D(512, 11, activation='relu', padding='same', name='block5_conv1')(x)
    x = AveragePooling1D(2, strides=2, name='block5_pool')(x)
    # Classification block
    x = Flatten(name='flatten')(x)
    x = Dense(4096, activation='relu', name='fc1')(x)
    x = Dense(4096, activation='relu', name='fc2')(x)
    out = Dense(classes, activation='softmax', name='predictions')(x)

    inputs = img_input
    # # Create model.
    # model = Model(inputs, x, name='cnn_best')
    # optimizer = RMSprop(lr=0.00001)
    # model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    # return model
    print('        -- model was built.')
    return inputs, out

optimizer = keras.optimizers.RMSprop(lr=1e-5)#'adam'#''adadelta'
model.compile(loss='categorical_crossentropy',
			  optimizer=optimizer,
			  metrics=['accuracy'])

The pytorch model is:

class ascadCNNbest(nn.Module):

    def __init__(self, num_classes):
        """ Constructor
        Args:
            num_classes: number of classes
        """
        super(ascadCNNbest, self).__init__()

        self.num_classes = num_classes
        self.traceLen = traceLen

        self.conv1 = nn.Conv1d(1, 64, kernel_size=11, stride=1, padding=5)
        self.conv2 = nn.Conv1d(64, 128, kernel_size=11, stride=1, padding=5)
        self.conv3 = nn.Conv1d(128, 256, kernel_size=11, stride=1, padding=5)
        self.conv4 = nn.Conv1d(256, 512, kernel_size=11, stride=1, padding=5)
        self.conv5 = nn.Conv1d(512, 512, kernel_size=11, stride=1, padding=5)

        self.fc1 = nn.Linear(10752, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, num_classes)


    def forward(self, x):
        out = F.relu(self.conv1(x))
        out = F.avg_pool1d(out, 2)

        out  = F.relu(self.conv2(out))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.conv3(out))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.conv4(out))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.conv5(out))
        out = F.avg_pool1d(out, 2)

        out = out.view(out.size(0), -1)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)

        return out

criterion = nn.CrossEntropyLoss()
optimizer = optim.RMSprop(net.parameters(),lr=1e-5)

For both models I’m using the same batch_size and same optimizer (RMSprop with learning rate of 1e-5, actually I’ve tried different learning rate for pytorch because of much worse results but still similar worse results). Thanks again for any help.

The only difference I see between the 2 models is that you use a ReLU after the last fully connected layer in PyTorch instead of a SoftMax in Keras.

Padding modes are also different. Can’t see your loss fn and optimizer so I’m not sure if there are differences there.

Thanks a lot. Indeed I made a mistake there, I’ve already changed it as you can see in my code, I use “None” instead of “ReLU” after the last fully connected layer in PyTorch, because I’m using CrossEntropyLoss loss function so I assume I don’t need “Softmax” activation here, right?

Thank you very much. I’ve added the loss fn and optimizer for both pytorch and keras cases. Would you please take a look? Regarding the padding mode in pytorch, how can I get the “same” mode as I used in keras case? I thought “padding=5” did the same thing as the keras model, or I’m wrong?

That’s what I got with the pytorch model:

ascadCNNbest (
(conv1): Conv1d (1, 64, kernel_size=(11,), stride=(1,), padding=(5,)), weights=((64, 1, 11), (64,)), parameters=768
(conv2): Conv1d (64, 128, kernel_size=(11,), stride=(1,), padding=(5,)), weights=((128, 64, 11), (128,)), parameters=90240
(conv3): Conv1d (128, 256, kernel_size=(11,), stride=(1,), padding=(5,)), weights=((256, 128, 11), (256,)), parameters=360704
(conv4): Conv1d (256, 512, kernel_size=(11,), stride=(1,), padding=(5,)), weights=((512, 256, 11), (512,)), parameters=1442304
(conv5): Conv1d (512, 512, kernel_size=(11,), stride=(1,), padding=(5,)), weights=((512, 512, 11), (512,)), parameters=2884096
(fc1): Linear(in_features=10752, out_features=4096), weights=((4096, 10752), (4096,)), parameters=44044288
(fc2): Linear(in_features=4096, out_features=4096), weights=((4096, 4096), (4096,)), parameters=16781312
(fc3): Linear(in_features=4096, out_features=256), weights=((256, 4096), (256,)), parameters=1048832
)

The keras one is as below,


Layer (type) Output Shape Param #


input_1 (InputLayer) (None, 700, 1) 0


block1_conv1 (Conv1D) (None, 700, 64) 768


block1_pool (AveragePooling1 (None, 350, 64) 0


block2_conv1 (Conv1D) (None, 350, 128) 90240


block2_pool (AveragePooling1 (None, 175, 128) 0


block3_conv1 (Conv1D) (None, 175, 256) 360704


block3_pool (AveragePooling1 (None, 87, 256) 0


block4_conv1 (Conv1D) (None, 87, 512) 1442304


block4_pool (AveragePooling1 (None, 43, 512) 0


block5_conv1 (Conv1D) (None, 43, 512) 2884096


block5_pool (AveragePooling1 (None, 21, 512) 0


flatten (Flatten) (None, 10752) 0


fc1 (Dense) (None, 4096) 44044288


fc2 (Dense) (None, 4096) 16781312


predictions (Dense) (None, 256) 1048832

torch.nn.ReplicationPad1d is what you should use.

Thanks a lot, dude. You mean sth. like this:

class ascadCNNbest(nn.Module):

    def __init__(self, num_classes):
        """ Constructor
        Args:
            num_classes: number of classes
        """
        super(ascadCNNbest, self).__init__()

        self.num_classes = num_classes
        self.traceLen = traceLen

        self.conv1 = nn.Conv1d(1, 64, kernel_size=11, stride=1, padding=0)
        self.pad1 = nn.ReflectionPad1d(5)
        self.conv2 = nn.Conv1d(64, 128, kernel_size=11, stride=1, padding=0)
        self.pad2 = nn.ReflectionPad1d(5)
        self.conv3 = nn.Conv1d(128, 256, kernel_size=11, stride=1, padding=0)
        self.pad3 = nn.ReflectionPad1d(5)
        self.conv4 = nn.Conv1d(256, 512, kernel_size=11, stride=1, padding=0)
        self.pad4 = nn.ReflectionPad1d(5)
        self.conv5 = nn.Conv1d(512, 512, kernel_size=11, stride=1, padding=0)
        self.pad5 = nn.ReflectionPad1d(5)

        self.fc1 = nn.Linear(10752, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, num_classes)


    def forward(self, x):
        out = F.relu(self.pad1(self.conv1(x)))
        out = F.avg_pool1d(out, 2)

        out  = F.relu(self.pad2(self.conv2(out)))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.pad3(self.conv3(out)))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.pad4(self.conv4(out)))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.pad5(self.conv5(out)))
        out = F.avg_pool1d(out, 2)

        out = out.view(out.size(0), -1)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)

        return out

Yes that is correct :slight_smile:

Thank you very much. In this case, the padding happens after the convolution, is it a problem (I think that is what keras does)? How can I do the padding before the convolution?

I’m not sure the Keras argument padding='same' means reflection padding.
It seems to be ZeroPadding, where the shape stays of the output is the same as the input’s shape.
Here are the docs.

Apparently, Keras does not support other values than zeros for padding? At least, I cannot find anything.

Thanks for pointing out this. Indeed by looking for some related posts, they are saying keras does pad zeros but not reflection paddings. Then I should go back to self.conv1 = nn.Conv1d(1, 64, kernel_size=11, stride=1, padding=5), right?

Yes, this looks right to me. Could you try that and report, if the results are still worse?

1 Like

Oh yeah you are correct. My bad!

I somehow thought padding='same' meant replication padding.

Yeah, from the naming it’s very likely. :wink:

Due to the randomness of the seeds in both keras and pytorch cases, the results are still slightly different but they are at least comparable. Thanks again for your help.

They have different default kernel initializers.

Hi~ryan! I have the same problem converting from keras to pytorch. May I ask at the end how do you fix the problem? I followed the track but it seems you did not change your code at the end. Thanks a lot!