Keras model is performing vastly superior to the Pytorch model

vampire-tap · August 6, 2019, 4:59pm

I am trying to train a 1-D ConvNet for time series classification as shown in this paper (refer to FCN om Fig. 1b) https://arxiv.org/pdf/1611.06455.pdf

The Keras implementation is giving me vastly superior performance. Could someone explain why is that the case?

The code for Pytorch is as follow:

class Net(torch.nn.Module):
def __init__(self):
    super(Net, self).__init__()
    self.conv1 = nn.Conv1d(x_train.shape[1], 128, 8)
    self.bnorm1 = nn.BatchNorm1d(128)        
    self.conv2 = nn.Conv1d(128, 256, 5)
    self.bnorm2 = nn.BatchNorm1d(256)
    self.conv3 = nn.Conv1d(256, 128, 3)
    self.bnorm3 = nn.BatchNorm1d(128)        
    self.dense = nn.Linear(128, nb_classes)

def forward(self, x):
   c1=F.relu(self.conv1(x))
   b1 = F.relu(self.bnorm1(c1))
   c2=F.relu(self.conv2(b1))
   b2 = F.relu(self.bnorm2(c2))
   c3=F.relu(self.conv3(b2))
   b3 = F.relu(self.bnorm3(c3))
   output = torch.mean(b3, 2)
   dense1=self.dense(output)
   return F.softmax(dense1)


 model = Net()
 criterion = nn.CrossEntropyLoss()
 optimizer = torch.optim.SGD(model.parameters(), lr=0.5, momentum=0.99)
 losses=[]
 for t in range(1000):
     y_pred_1= model(x_train.float())
     loss_1 = criterion(y_pred_1, y_train.long())
     print(t, loss_1.item())
     optimizer.zero_grad()
     loss_1.backward()
     optimizer.step()

For comparison, I use the following code for Keras:

x = keras.layers.Input(x_train.shape[1:])
conv1 = keras.layers.Conv1D(128, 8, padding='valid')(x)
conv1 = keras.layers.BatchNormalization()(conv1)
conv1 = keras.layers.Activation('relu')(conv1)
conv2 = keras.layers.Conv1D(256, 5, padding='valid')(conv1)
conv2 = keras.layers.BatchNormalization()(conv2)
conv2 = keras.layers.Activation('relu')(conv2)
conv3 = keras.layers.Conv1D(128, 3, padding='valid')(conv2)
conv3 = keras.layers.BatchNormalization()(conv3)
conv3 = keras.layers.Activation('relu')(conv3)
full = keras.layers.GlobalAveragePooling1D()(conv3)
out = keras.layers.Dense(nb_classes, activation='softmax')(full)

model = keras.models.Model(inputs=x, outputs=out) 
optimizer = keras.optimizers.SGD(lr=0.5, decay=0.0, momentum=0.99)
model.compile(loss='categorical_crossentropy', optimizer=optimizer) 
hist = model.fit(x_train, Y_train, batch_size=x_train.shape[0], nb_epoch=2000)

The only difference I see between the two is the initialization but however, the results are just vastly different. For reference, I use the same preprocessing as follows for both the datasets, with a subtle difference in input shapes, for Pytorch (Batch_Size, Channels, Length) and for Keras: (Batch_Size, Length, Channels).

ptrblck · August 6, 2019, 9:27pm

Here are a few differences skimming through the code:

Keras uses Conv1D-BN-ReLU, while your PyTorch model uses Conv1d-ReLU-BN1d-ReLU
I’m not sure what GlobalAveragePooling1D is exactly doing, but I assume you made sure torch.mean(..., 2) is doing the same
nn.CrossEntropyLoss expects raw logits, so you should remove the F.softmax call

Could you fix these issues and try it again?

Intel_Novel · August 6, 2019, 9:41pm

It has no sense to use Conv1d-ReLU-BN1d-ReLU.
Have you meant to say: Conv1d-ReLU-BN1d.

ptrblck · August 6, 2019, 9:47pm

No, I’m pointing the user to the wrong implementation in the PyTorch model.

c1=F.relu(self.conv1(x))
b1 = F.relu(self.bnorm1(c1))

Intel_Novel · August 6, 2019, 9:53pm

Aha, but there is no much need to use

b1 = F.relu(self.bnorm1(c1))

in my understanding. If the paper claims so, I would say this has no sense.

ptrblck · August 6, 2019, 9:54pm

Maybe there is a misunderstanding, but I’m trying to say the same.
The PyTorch model should be fixed and adapted to the Keras one, which uses Conv-BN-ReLU.

Intel_Novel · August 6, 2019, 9:58pm

No problemo, Keras model uses Conv-BN-ReLU

conv1 = keras.layers.Conv1D(128, 8, padding='valid')(x)
conv1 = keras.layers.BatchNormalization()(conv1)
conv1 = keras.layers.Activation('relu')(conv1)

which is so-so, and I would use what you suggested Conv-ReLU-BN for all models. This is a good suggestion.

vampire-tap · August 7, 2019, 9:06am

Hi guys! thank you for pointing out the mistakes, I corrected it, seems to actually learn something now, but I don’t get why is it that CrossEntropyLoss requires raw logits? Why should I not have a softmax there? And as a side note, when I changed it to “Log softmax” the model started performing much better.

Intel_Novel · August 7, 2019, 9:18am

If you investigate on CrossEntropyLoss actually you will find it assumes LogSoftmax and NLLLoss together.

vampire-tap · August 7, 2019, 9:55am

But now do I give it Logsoftmax or just give it the last dense layer output without any activation? what would make sense for Pytorch’s CrossEntropyLoss?

Intel_Novel · August 7, 2019, 10:14am

Just give it the last dense layer output. This is the activation. It will predict
nb_classes. What is your nb_classes value.? Corresponds to what? You may need to use argmax to grab the class value.