Different results for same implementation in Pytorch and Keras

Hello everyone! I’m trying to implement my Keras/TF model in PyTorch. The model is a simple conv3d. Both models are running on CPU. So what am I doing wrong? Because the loss at any step (training, validation, test) are totally different between Pytorch and Keras.

[Keras]

epochs = 10
used_samples = 100
batch_size=10
validation_split = 0.2

with h5py.File(DATASET_FILE) as f:
    real_tese = f['real'][...]
    pred_tese = f['pred'][...]

pred = pred_tese[:used_samples]
real = real_tese[:used_samples]

if (validation_split):
    split = int(pred.shape[0] * (1. - validation_split))

    X_val = pred[split:]
    y_val = real[split:]
    X_train = pred[:split]
    y_train = real[:split]

X_test = pred_tese[used_samples:-1]
y_test = real_tese[used_samples:-1]

seq = Sequential()
seq.add(Conv3D(filters=1, kernel_size=(3,3,3), padding='same',
       data_format='channels_last'))

seq.compile(loss='mae', optimizer='rmsprop')
seq.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, 
       validation_data=(X_val, y_val), shuffle=False)
scores = seq.evaluate(X_test, y_test)

[PyTorch]

epochs = 10
used_samples = 100
batch_size=10
validation_split=0.2

class H5Dataset(Dataset):

    def __init__(self, file_path, samples, validation_split=0, isValidation=False):
        super(H5Dataset, self).__init__()
        h5_file = h5py.File(file_path,'r')
        pred = h5_file.get('pred')[...]
        real = h5_file.get('real')[...]
        numpy_pred = pred[:samples]
        numpy_real = real[:samples]
        
        if (validation_split):
            split = int(numpy_pred.shape[0] * (1. - validation_split))
        
            if(isValidation):
                numpy_pred = numpy_pred[split:]
                numpy_real = numpy_real[split:]
            else:
                numpy_pred = numpy_pred[:split]
                numpy_real = numpy_real[:split]
        
        if (isTest):
            numpy_pred = pred[samples:-1]
            numpy_real = real[samples:-1]
        
        self.X = torch.from_numpy(numpy_pred).float().permute(0, 4, 1, 2, 3)
        self.y = torch.from_numpy(numpy_real).float().permute(0, 4, 1, 2, 3)
        del pred; del real; del numpy_pred; del numpy_real

    def __getitem__(self, index):
        return (self.X[index,:,:,:,:], self.y[index,:,:,:,:])

    def __len__(self):
        return self.X.shape[0]
		
(...)

params = {'shuffle': False, 'num_workers': 1}

train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, **params)
val_loader = DataLoader(dataset=val_dataset, batch_size=batch_size, **params)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, **params)

class Conv.Module):
    def __init__(self):
        super(Conv, self).__init__()
        self.conv = nn.Conv3d(in_channels=5, out_channels=1, kernel_size=(3,3,3), 
            padding=(1,1,1))
        
    def forward(self, x):
        out = self.conv(x)
        return out
		
model = Conv().to(device)
criterion = nn.L1Loss()
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001, alpha=0.9, eps=1e-6)

def train(model, loss_fn, dataloader, device):
    model.train()
    epoch_loss = 0.0
    for i, (feature, target) in enumerate(dataloader):
        feature, target = feature.to(device), target.to(device) 
        optimizer.zero_grad()
        output = model(feature)
        loss = loss_fn(output, target)
        epoch_loss += loss.item()       
        loss.backward()
        optimizer.step()
    return  epoch_loss/len(dataloader)
	
def evaluate(model, loss_fn, dataloader, device):
    model.eval()
    epoch_loss = 0.0
    with torch.no_grad(): 
        for feature, target in dataloader:
            feature, target = feature.to(device), target.to(device)
            output = model(feature)
            loss = loss_fn(output, target)
            epoch_loss += loss.item()
    return epoch_loss/len(dataloader)
	
(...)

Results example:

[Keras]

Train on 80 samples, validate on 20 samples
Epoch 1/10 - loss: 32.1959 - val_loss: 9.7006
Epoch 2/10 - loss: 4.9858 - val_loss: 3.6706
Epoch 3/10 - loss: 3.8655 - val_loss: 3.6865
(...)
Epoch 8/10 - loss: 3.8235 - val_loss: 3.5489
Epoch 9/10 - loss: 3.7683 - val_loss: 3.4872
Epoch 10/10 - loss: 3.7115 - val_loss: 3.4237

Test_loss: 3.65

[PyTorch]

Train on 80 samples, validate on 20 samples
Epoch: 1/10 - loss: 16.0479 - val_loss: 3.7342
Epoch: 2/10 - loss: 3.7276 - val_loss: 4.0231
Epoch: 3/10 - loss: 3.8023 - val_loss: 4.0526
(...)
Epoch: 8/10 - loss: 3.5709 - val_loss: 3.8442
Epoch: 9/10 - loss: 3.5239 - val_loss: 3.7991
Epoch: 10/10 - loss: 3.4762 - val_loss: 3.7512

Test_loss: 3.39

I know the PyTorch data format is “channel first” and Keras is “channel last”, and that’s the reason I used permute().

Thanks for the help!

Ho many times did you run this?

Skimming through your examples I could only see the difference in the parameter initialization. While Keras seems to use glorot/xavier_uniform for the weights and zeros for the bias, PyTorch uses kaiming_uniform for the weights and some other uniform init for the bias source.

Could you try to initialize both the same way and run your code again?

3 Likes

A couple of time, but I got the same results. The results are reproducible in both, Pytorch and Keras. I forgot to put this part of the code:

[PyTorch]

seed = 0
torch.manual_seed(seed)
np.random.seed(seed)
rd.seed(seed)
1 Like

Ohh I didn’t know that. I will try to initialize both the same way, as you said, and I will post the results here.
Thanks.

I have changed the initialization of Pytorch and Keras to “zeros” for weights and bias and it worked. Thank you very much for your help @ptrblck !!!

[Keras]

seq.add(Conv3D(filters=1, kernel_size=(3,3,3), padding='same', 
				data_format='channels_last', 
                kernel_initializer='zeros', 
				bias_initializer='zeros'))

[PyTorch]

        self.conv = nn.Conv3d(in_channels=5, out_channels=1, kernel_size=(3,3,3), 
padding=(1,1,1))
        nn.init.zeros_(self.conv.weight)
        nn.init.zeros_(self.conv.bias)

Results

[PyTorch] and [Keras]

Train on 80 samples, validate on 20 samples
Epoch: 1/10 - loss: 8.7088 - val_loss: 4.0080
Epoch: 2/10 - loss: 3.6342 - val_loss: 3.9557
Epoch: 3/10 - loss: 3.5624 - val_loss: 3.8861
(...)
Epoch: 8/10 - loss: 3.2063 - val_loss: 3.5333
Epoch: 9/10 - loss: 3.1367 - val_loss: 3.4623
Epoch: 10/10 - loss: 3.0679 - val_loss: 3.3903

Test_loss: 3.08

Just for the records, I tried first to use glorot/xavier_uniform for weights, but it seems that the implementations of Pytorch and Keras are slightly diffent. Because in Keras you can set a seed or if you don’t the default value used is seed = np.random.randint(10e6). With this initialization method, Pytorch and Keras gave different results. So, that’s why I changed to zeros.

1 Like

Good to hear it’s working!
It was a good idea to compare both models with a constant initialization scheme, as both frameworks will most likely return different “random” values even for the same seed.

1 Like

@rafaela00castro, @ptrblck …thank you for the discussion, very helpful.

I am trying to convert a Keras model to PyTorch…but even when setting up the initializers (weights & bias to 0) as suggested here, the loss remains slightly different between the 2 frameworks. I turned off all callbacks (LearningRate scheduler) in the keras implementation. Also, the batch (size=8) and is made of the same image repeated 8 times (so that at every step, the keras and Pytorch models use the same example - prevent differences because of data shuffling). Here are the results:

PyTorch

Epoch 1/10: loss=0.589
Epoch 2/10: loss=0.589
Epoch 3/10: loss=0.589
Epoch 4/10: loss=0.589
Epoch 5/10: loss=0.589
Epoch 6/10: loss=0.589
Epoch 7/10: loss=0.589
Epoch 8/10: loss=0.588
Epoch 9/10: loss=0.588
Epoch 10/10: loss=0.588

Keras

Epoch 1/10: loss=0.5892
Epoch 2/10: loss=0.5892
Epoch 3/10: loss=0.5892
Epoch 4/10: loss=0.5892
Epoch 5/10: loss=0.5892
Epoch 6/10: loss=0.5892
Epoch 7/10:  loss=0.5891
Epoch 8/10:  loss=0.5891
Epoch 9/10: loss=0.5891
Epoch 10/10: loss=0.5891

The 2 implementations start at about the same loss, but the Pytorch model loss decreases faster. I just wonder whether the differences in the output of the 2 models are marginal (when converting models between 2 frameworks), or the loss values should better match. Thank you.

I converted Inception(InceptionTime) model from keras to pytorch.
when I print the model summary it seems that both models has same architectures and same number of parameters. I checked performance of the both models in 18 datasets. On 15 datasets they roughly output same performance but on 3 datasets the accuracies are different (about 5% gap).

I think that the difference comes from optimization setup. I shared optimizer setup of both models .
Could you please tell me if these two optimization setups are doing same thing?

Keras Code:

model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(),
                      metrics=['accuracy'])
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='loss', factor=0.5, patience=50,
                                              min_lr=0.0001)

model_checkpoint = keras.callbacks.ModelCheckpoint(filepath=file_path, monitor='loss', save_best_only=True)


Pytorch Code:

    criterion_CE = nn.CrossEntropyLoss()
    optimizer = optim.Adam(teacher.parameters(), lr=0.001,)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=50, min_lr=0.0001, )
    use_cuda = torch.cuda.is_available()        
    best_model_wts = copy.deepcopy(teacher.state_dict())
    min_train_loss = np.inf


    for epoch in range(epochs):
        train_loss = train_alone_model(teacher, epoch)
        test(teacher)
        if min_train_loss  > train_loss:
            min_train_loss = train_loss
            best_model_wts = copy.deepcopy(teacher.state_dict())
        scheduler.step(train_loss)
    

I also used ReduceLROnPlateau on both models. In keras it works normal that’s, learning rate gradually decreases till its minimum value(min_lr) but in pytorch learning rate rarely decreases not as in keras.

Here is the plot of training losses of the both models.

Batch size, number of epochs and initial learning rate are same in both models.

Thanks in advance!