Why do I get much worse results by using pytorch model than using keras

ryan · April 9, 2018, 9:09pm

I’m trying to move from keras to pytorch, however with the same model and same optimizer with keras and pytorch, I always get much worse results by using pytorch, any idea why I get this strange results? Thanks a lot for your help.

The keras model what I’m using is:

def cnn_best(input_shape, classes):
    # From VGG16 design
    input_shape = (700, 1)
    img_input = Input(shape=input_shape)
    # Block 1
    x = Conv1D(64, 11, activation='relu', padding='same', name='block1_conv1')(img_input)
    x = AveragePooling1D(2, strides=2, name='block1_pool')(x)
    # Block 2
    x = Conv1D(128, 11, activation='relu', padding='same', name='block2_conv1')(x)
    x = AveragePooling1D(2, strides=2, name='block2_pool')(x)
    # Block 3
    x = Conv1D(256, 11, activation='relu', padding='same', name='block3_conv1')(x)
    x = AveragePooling1D(2, strides=2, name='block3_pool')(x)
    # Block 4
    x = Conv1D(512, 11, activation='relu', padding='same', name='block4_conv1')(x)
    x = AveragePooling1D(2, strides=2, name='block4_pool')(x)
    # Block 5
    x = Conv1D(512, 11, activation='relu', padding='same', name='block5_conv1')(x)
    x = AveragePooling1D(2, strides=2, name='block5_pool')(x)
    # Classification block
    x = Flatten(name='flatten')(x)
    x = Dense(4096, activation='relu', name='fc1')(x)
    x = Dense(4096, activation='relu', name='fc2')(x)
    out = Dense(classes, activation='softmax', name='predictions')(x)

    inputs = img_input
    # # Create model.
    # model = Model(inputs, x, name='cnn_best')
    # optimizer = RMSprop(lr=0.00001)
    # model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    # return model
    print('        -- model was built.')
    return inputs, out

optimizer = keras.optimizers.RMSprop(lr=1e-5)#'adam'#''adadelta'
model.compile(loss='categorical_crossentropy',
			  optimizer=optimizer,
			  metrics=['accuracy'])

The pytorch model is:

class ascadCNNbest(nn.Module):

    def __init__(self, num_classes):
        """ Constructor
        Args:
            num_classes: number of classes
        """
        super(ascadCNNbest, self).__init__()

        self.num_classes = num_classes
        self.traceLen = traceLen

        self.conv1 = nn.Conv1d(1, 64, kernel_size=11, stride=1, padding=5)
        self.conv2 = nn.Conv1d(64, 128, kernel_size=11, stride=1, padding=5)
        self.conv3 = nn.Conv1d(128, 256, kernel_size=11, stride=1, padding=5)
        self.conv4 = nn.Conv1d(256, 512, kernel_size=11, stride=1, padding=5)
        self.conv5 = nn.Conv1d(512, 512, kernel_size=11, stride=1, padding=5)

        self.fc1 = nn.Linear(10752, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, num_classes)


    def forward(self, x):
        out = F.relu(self.conv1(x))
        out = F.avg_pool1d(out, 2)

        out  = F.relu(self.conv2(out))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.conv3(out))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.conv4(out))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.conv5(out))
        out = F.avg_pool1d(out, 2)

        out = out.view(out.size(0), -1)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)

        return out

criterion = nn.CrossEntropyLoss()
optimizer = optim.RMSprop(net.parameters(),lr=1e-5)

For both models I’m using the same batch_size and same optimizer (RMSprop with learning rate of 1e-5, actually I’ve tried different learning rate for pytorch because of much worse results but still similar worse results). Thanks again for any help.

Latope2-150 · April 9, 2018, 9:14pm

The only difference I see between the 2 models is that you use a ReLU after the last fully connected layer in PyTorch instead of a SoftMax in Keras.

SimonW · April 9, 2018, 9:28pm

Padding modes are also different. Can’t see your loss fn and optimizer so I’m not sure if there are differences there.

ryan · April 10, 2018, 4:46am

Thanks a lot. Indeed I made a mistake there, I’ve already changed it as you can see in my code, I use “None” instead of “ReLU” after the last fully connected layer in PyTorch, because I’m using CrossEntropyLoss loss function so I assume I don’t need “Softmax” activation here, right?

ryan · April 10, 2018, 4:51am

Thank you very much. I’ve added the loss fn and optimizer for both pytorch and keras cases. Would you please take a look? Regarding the padding mode in pytorch, how can I get the “same” mode as I used in keras case? I thought “padding=5” did the same thing as the keras model, or I’m wrong?

That’s what I got with the pytorch model:

ascadCNNbest (
(conv1): Conv1d (1, 64, kernel_size=(11,), stride=(1,), padding=(5,)), weights=((64, 1, 11), (64,)), parameters=768
(conv2): Conv1d (64, 128, kernel_size=(11,), stride=(1,), padding=(5,)), weights=((128, 64, 11), (128,)), parameters=90240
(conv3): Conv1d (128, 256, kernel_size=(11,), stride=(1,), padding=(5,)), weights=((256, 128, 11), (256,)), parameters=360704
(conv4): Conv1d (256, 512, kernel_size=(11,), stride=(1,), padding=(5,)), weights=((512, 256, 11), (512,)), parameters=1442304
(conv5): Conv1d (512, 512, kernel_size=(11,), stride=(1,), padding=(5,)), weights=((512, 512, 11), (512,)), parameters=2884096
(fc1): Linear(in_features=10752, out_features=4096), weights=((4096, 10752), (4096,)), parameters=44044288
(fc2): Linear(in_features=4096, out_features=4096), weights=((4096, 4096), (4096,)), parameters=16781312
(fc3): Linear(in_features=4096, out_features=256), weights=((256, 4096), (256,)), parameters=1048832
)

The keras one is as below,

Layer (type) Output Shape Param #

input_1 (InputLayer) (None, 700, 1) 0

block1_conv1 (Conv1D) (None, 700, 64) 768

block1_pool (AveragePooling1 (None, 350, 64) 0

block2_conv1 (Conv1D) (None, 350, 128) 90240

block2_pool (AveragePooling1 (None, 175, 128) 0

block3_conv1 (Conv1D) (None, 175, 256) 360704

block3_pool (AveragePooling1 (None, 87, 256) 0

block4_conv1 (Conv1D) (None, 87, 512) 1442304

block4_pool (AveragePooling1 (None, 43, 512) 0

block5_conv1 (Conv1D) (None, 43, 512) 2884096

block5_pool (AveragePooling1 (None, 21, 512) 0

flatten (Flatten) (None, 10752) 0

fc1 (Dense) (None, 4096) 44044288

fc2 (Dense) (None, 4096) 16781312

predictions (Dense) (None, 256) 1048832

SimonW · April 10, 2018, 5:36am

torch.nn.ReplicationPad1d is what you should use.

ryan · April 10, 2018, 6:14am

Thanks a lot, dude. You mean sth. like this:

class ascadCNNbest(nn.Module):

    def __init__(self, num_classes):
        """ Constructor
        Args:
            num_classes: number of classes
        """
        super(ascadCNNbest, self).__init__()

        self.num_classes = num_classes
        self.traceLen = traceLen

        self.conv1 = nn.Conv1d(1, 64, kernel_size=11, stride=1, padding=0)
        self.pad1 = nn.ReflectionPad1d(5)
        self.conv2 = nn.Conv1d(64, 128, kernel_size=11, stride=1, padding=0)
        self.pad2 = nn.ReflectionPad1d(5)
        self.conv3 = nn.Conv1d(128, 256, kernel_size=11, stride=1, padding=0)
        self.pad3 = nn.ReflectionPad1d(5)
        self.conv4 = nn.Conv1d(256, 512, kernel_size=11, stride=1, padding=0)
        self.pad4 = nn.ReflectionPad1d(5)
        self.conv5 = nn.Conv1d(512, 512, kernel_size=11, stride=1, padding=0)
        self.pad5 = nn.ReflectionPad1d(5)

        self.fc1 = nn.Linear(10752, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, num_classes)


    def forward(self, x):
        out = F.relu(self.pad1(self.conv1(x)))
        out = F.avg_pool1d(out, 2)

        out  = F.relu(self.pad2(self.conv2(out)))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.pad3(self.conv3(out)))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.pad4(self.conv4(out)))
        out = F.avg_pool1d(out, 2)

        out = F.relu(self.pad5(self.conv5(out)))
        out = F.avg_pool1d(out, 2)

        out = out.view(out.size(0), -1)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)

        return out

SimonW · April 10, 2018, 2:39pm

Yes that is correct

ryan · April 10, 2018, 2:56pm

Thank you very much. In this case, the padding happens after the convolution, is it a problem (I think that is what keras does)? How can I do the padding before the convolution?

ptrblck · April 10, 2018, 2:56pm

I’m not sure the Keras argument padding='same' means reflection padding.
It seems to be ZeroPadding, where the shape stays of the output is the same as the input’s shape.
Here are the docs.

Apparently, Keras does not support other values than zeros for padding? At least, I cannot find anything.

ryan · April 10, 2018, 3:04pm

Thanks for pointing out this. Indeed by looking for some related posts, they are saying keras does pad zeros but not reflection paddings. Then I should go back to self.conv1 = nn.Conv1d(1, 64, kernel_size=11, stride=1, padding=5), right?

ptrblck · April 10, 2018, 3:06pm

Yes, this looks right to me. Could you try that and report, if the results are still worse?

SimonW · April 10, 2018, 3:24pm

Oh yeah you are correct. My bad!

SimonW · April 10, 2018, 3:25pm

I somehow thought padding='same' meant replication padding.

ptrblck · April 10, 2018, 3:38pm

Yeah, from the naming it’s very likely.

ryan · April 10, 2018, 6:00pm

Due to the randomness of the seeds in both keras and pytorch cases, the results are still slightly different but they are at least comparable. Thanks again for your help.

jiqiujia · April 13, 2018, 11:21am

They have different default kernel initializers.

chen1177 · March 24, 2020, 9:39am

Hi~ryan! I have the same problem converting from keras to pytorch. May I ask at the end how do you fix the problem? I followed the track but it seems you did not change your code at the end. Thanks a lot!