Keras model is performing vastly superior to the Pytorch model

I am trying to train a 1-D ConvNet for time series classification as shown in this paper (refer to FCN om Fig. 1b)

The Keras implementation is giving me vastly superior performance. Could someone explain why is that the case?

The code for Pytorch is as follow:

class Net(torch.nn.Module):
def __init__(self):
    super(Net, self).__init__()
    self.conv1 = nn.Conv1d(x_train.shape[1], 128, 8)
    self.bnorm1 = nn.BatchNorm1d(128)        
    self.conv2 = nn.Conv1d(128, 256, 5)
    self.bnorm2 = nn.BatchNorm1d(256)
    self.conv3 = nn.Conv1d(256, 128, 3)
    self.bnorm3 = nn.BatchNorm1d(128)        
    self.dense = nn.Linear(128, nb_classes)

def forward(self, x):
   b1 = F.relu(self.bnorm1(c1))
   b2 = F.relu(self.bnorm2(c2))
   b3 = F.relu(self.bnorm3(c3))
   output = torch.mean(b3, 2)
   return F.softmax(dense1)

 model = Net()
 criterion = nn.CrossEntropyLoss()
 optimizer = torch.optim.SGD(model.parameters(), lr=0.5, momentum=0.99)
 for t in range(1000):
     y_pred_1= model(x_train.float())
     loss_1 = criterion(y_pred_1, y_train.long())
     print(t, loss_1.item())

For comparison, I use the following code for Keras:

x = keras.layers.Input(x_train.shape[1:])
conv1 = keras.layers.Conv1D(128, 8, padding='valid')(x)
conv1 = keras.layers.BatchNormalization()(conv1)
conv1 = keras.layers.Activation('relu')(conv1)
conv2 = keras.layers.Conv1D(256, 5, padding='valid')(conv1)
conv2 = keras.layers.BatchNormalization()(conv2)
conv2 = keras.layers.Activation('relu')(conv2)
conv3 = keras.layers.Conv1D(128, 3, padding='valid')(conv2)
conv3 = keras.layers.BatchNormalization()(conv3)
conv3 = keras.layers.Activation('relu')(conv3)
full = keras.layers.GlobalAveragePooling1D()(conv3)
out = keras.layers.Dense(nb_classes, activation='softmax')(full)

model = keras.models.Model(inputs=x, outputs=out) 
optimizer = keras.optimizers.SGD(lr=0.5, decay=0.0, momentum=0.99)
model.compile(loss='categorical_crossentropy', optimizer=optimizer) 
hist =, Y_train, batch_size=x_train.shape[0], nb_epoch=2000)      

The only difference I see between the two is the initialization but however, the results are just vastly different. For reference, I use the same preprocessing as follows for both the datasets, with a subtle difference in input shapes, for Pytorch (Batch_Size, Channels, Length) and for Keras: (Batch_Size, Length, Channels).

Here are a few differences skimming through the code:

  • Keras uses Conv1D-BN-ReLU, while your PyTorch model uses Conv1d-ReLU-BN1d-ReLU
  • I’m not sure what GlobalAveragePooling1D is exactly doing, but I assume you made sure torch.mean(..., 2) is doing the same
  • nn.CrossEntropyLoss expects raw logits, so you should remove the F.softmax call

Could you fix these issues and try it again? :slight_smile:

It has no sense to use Conv1d-ReLU-BN1d-ReLU.
Have you meant to say: Conv1d-ReLU-BN1d.

No, I’m pointing the user to the wrong implementation in the PyTorch model. :wink:

b1 = F.relu(self.bnorm1(c1))

Aha, but there is no much need to use

b1 = F.relu(self.bnorm1(c1))

in my understanding. If the paper claims so, I would say this has no sense. :slight_smile:

Maybe there is a misunderstanding, but I’m trying to say the same.
The PyTorch model should be fixed and adapted to the Keras one, which uses Conv-BN-ReLU.

No problemo, Keras model uses Conv-BN-ReLU

conv1 = keras.layers.Conv1D(128, 8, padding='valid')(x)
conv1 = keras.layers.BatchNormalization()(conv1)
conv1 = keras.layers.Activation('relu')(conv1)

which is so-so, and I would use what you suggested Conv-ReLU-BN for all models. This is a good suggestion.

Hi guys! thank you for pointing out the mistakes, I corrected it, seems to actually learn something now, but I don’t get why is it that CrossEntropyLoss requires raw logits? Why should I not have a softmax there? And as a side note, when I changed it to “Log softmax” the model started performing much better.

If you investigate on CrossEntropyLoss actually you will find it assumes LogSoftmax and NLLLoss together.

But now do I give it Logsoftmax or just give it the last dense layer output without any activation? what would make sense for Pytorch’s CrossEntropyLoss?

Just give it the last dense layer output. This is the activation. It will predict
nb_classes. What is your nb_classes value.? Corresponds to what? You may need to use argmax to grab the class value.