Model copied from Keras doesn't converge

Good day,
I am trying to switch from Keras + TensorFlow to PyTorch.
To begin I am trying to reimplement the most basic model from my project, that is a small densely connected network.
However, despite looking the same and using the same optimizer (default Adam), Keras one converges in about 15 epochs with MSE of around 0.3, while PyTorch variant reaches MSE of around 8 billions.
I am almost certain that there is some basic mistake on my part, so I ask you to help me find it.

Here’s the summary of the Keras model:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_4 (InputLayer)         (None, 44)                0         
_________________________________________________________________
dense_10 (Dense)             (None, 44)                1980      
_________________________________________________________________
activation_10 (Activation)   (None, 44)                0         
_________________________________________________________________
dense_11 (Dense)             (None, 32)                1440      
_________________________________________________________________
activation_11 (Activation)   (None, 32)                0         
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 33        
_________________________________________________________________
activation_12 (Activation)   (None, 1)                 0         
=================================================================
Total params: 3,453
Trainable params: 3,453
Non-trainable params: 0
_________________________________________________________________

And here’s the one for Pytorch:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Linear-1             [-1, 8962, 44]           1,980
              ReLU-2             [-1, 8962, 44]               0
            Linear-3             [-1, 8962, 32]           1,440
              ReLU-4             [-1, 8962, 32]               0
            Linear-5              [-1, 8962, 1]              33
================================================================
Total params: 3,453
Trainable params: 3,453
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 1.50
Forward/backward pass size (MB): 10.46
Params size (MB): 0.01
Estimated Total Size (MB): 11.98
----------------------------------------------------------------

My input tensor is created from the numpy array with the following command and result:

x_train = torch.from_numpy(X_train_str).float().to(device)
tensor([[ 0.1957,  1.4893,  0.7181,  ..., -0.1391,  0.3075, -0.2688],
        [ 0.3367, -0.3090, -0.6731,  ..., -0.1391,  0.3075, -0.2688],
        [ 0.8814, -0.2180, -0.3253,  ..., -0.1391, -3.2525,  3.7199],
        ...,
        [-0.5159, -1.0219, -0.6731,  ...,  7.1915, -3.2525, -0.2688],
        [ 0.4028,  0.3993,  1.4138,  ..., -0.1391,  0.3075, -0.2688],
        [-0.0705,  0.2153,  0.7181,  ..., -0.1391,  0.3075, -0.2688]],
       device='cuda:0')

And the output (a single vector with the dependent variable values):

y_train = torch.from_numpy(np.array(y_train)).float().to(device)
tensor([ 6.8199, 17.4941, 14.3596,  ..., 11.4096,  6.7144, 14.9953],
       device='cuda:0')

Input and output for Keras are the original arrays:

X_train_str:
array([[ 0.19572722,  1.4892545 ,  0.71813392, ..., -0.13905308,
         0.307455  , -0.26882353],
       [ 0.33670636, -0.30899649, -0.67311472, ..., -0.13905308,
         0.307455  , -0.26882353],
       [ 0.8814075 , -0.21798822, -0.32530256, ..., -0.13905308,
        -3.25250847,  3.71991241],
       ...,
       [-0.51589407, -1.02193099, -0.67311472, ...,  7.19149825,
        -3.25250847, -0.26882353],
       [ 0.4027681 ,  0.39929304,  1.41375824, ..., -0.13905308,
         0.307455  , -0.26882353],
       [-0.07051303,  0.21533594,  0.71813392, ..., -0.13905308,
         0.307455  , -0.26882353]])
y_train:
8556      6.819938
8742     17.494090
11390    14.359563
36       10.112089
4902     10.639344
           ...    
11284    13.418195
5191      7.310966
5390     11.409576
860       6.714400
7270     14.995305
Name: Price Per Meter Squared, Length: 8962, dtype: float64

Complete definition of the Network in PyTorch:

device = torch.device('cuda')

x_train_tensor = torch.from_numpy(X_train_str).float().to(device)
y_train_tensor = torch.from_numpy(np.array(y_train)).float().to(device)

model = torch.nn.Sequential(
    torch.nn.Linear(44, 44),
    torch.nn.ReLU(),
    torch.nn.Linear(44, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 1),
).to(device)

#target = target.view(-1,1)

loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters())
          
summary(model, (x_train_tensor.shape))
    
for t in range(500):
  y_pred = model(x_train_tensor)
  loss = loss_fn(y_pred, y_train_tensor)
  print(t, loss.item())
 
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

and the same in Keras:

inputs_str = Input(shape=X_train_str[0].shape)
x = layers.Dense(44)(inputs_str)
x = layers.Activation('relu')(x)
x = layers.Dense(32)(x)
x = layers.Activation('relu')(x)
x = layers.Dense(1)(x)
output_str = layers.Activation('linear')(x)
model_str = Model(inputs=inputs_str, outputs=output_str)
model_str.compile(optimizer='adam', loss='mse', metrics=['mae'])
model_str.summary()
model_str.fit(X_train_str, y_train, validation_data=(X_val_str, y_val), epochs=500l, batch_size=64, callbacks=[model_str_save])

Could you print the shpaes of y_pred as well as y_train_tensor and check, if unwanted broadcasting is happening in the criterion?

Do you know, what loss reduction is use in Keras by default? Since you are summing the batch loss, I assume Keras is doing the same?

Also, what does the 8962 stand for in the model summary?

Sure,

y_pred.shape
torch.Size([8962, 1])
y_train_tensor.shape
torch.Size([8962])

It does give the following error from time to time:

/home/vintodrimmer/.local/lib/python3.6/site-packages/torch/nn/modules/loss.py:431: UserWarning: Using a target size (torch.Size([8962])) that is different to the input size (torch.Size([8962, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)

I tried to solve it with

target = target.view(-1,1)

but I’m not sure whether it’s working, since the error message is not always printed.

Do you know, what loss reduction is use in Keras by default? Since you are summing the batch loss, I assume Keras is doing the same?

It should be SUM_OVER_BATCH_SIZE by default, so I assume them to be the same.

what does the 8962 stand for in the model summary?

Number of observations in the training set.

EDIT: after unsqueezing it does work, but still only approaches MSE of 600 after 30 000 epochs.
Is that to be expected?

You are not splitting your data into batches, like you are doing in keras by passing batch size = 64,
instead you are passing “whole” x_train of shape (8964, 44) at eatch training iteration, try splitting your training data and pass a batch of x_train with shape of (batch_size, 44) at each training iteration. Also shuffle the data after each epoch.

try using code below-

train_iterator = torch.utils.data.TensorDataset(x_train_tensor, y_train_tensor) 
# make sure y_train_tensor has shape (8964, 1), not (8964)
train_data = torch.utils.data.DataLoader(train_iterator, batch_size = 64, shuffle = True)

for x, y in train_data:
    break
print(x.shape, y.shape) # will print (64, 44), (64, 1)

model = torch.nn.Sequential(
    torch.nn.Linear(44, 44),
    torch.nn.ReLU(),
    torch.nn.Linear(44, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 1),
).to(device)


criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters())

epochs = 20
batches = len(train_data)
# number of batches in your dataset (where each batch contains 64 examples from your training data)

for epoch in range(epochs):
    epoch_loss = 0.0 # initializing epoch_loss to zero 
    for features, labels in train_data:
        outputs = model(features)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item() # here we are accumulating loss for each training batch.
    print(f'Epoch: {epoch} -> Loss: {(epoch_loss/batches):.8f}')
    # dividing epoch_loss by number of batches in our train_data to get mean epoch_loss
    # and printing epoch loss round of to 8 decimals

Another minor error I’ve noticed (not related to your loss convergence) that while using summary function to get the summary of your model you are once again passing whole shape of your x_train which is (8964, 44) leading you wrong layer shapes in the results (I’m sure you didn’t realize it), instead you need to pass the shape of single example, so running summary(model, x_train_tensor[0].shape) will give correct summary (just as in keras)

----------------------------------------------------------------
        Layer (type)               Output Shape         Param # 
================================================================
            Linear-1                   [-1, 44]           1,980 
              ReLU-2                   [-1, 44]               0 
            Linear-3                   [-1, 32]           1,440 
              ReLU-4                   [-1, 32]               0 
            Linear-5                    [-1, 1]              33 
================================================================
Total params: 3,453
Trainable params: 3,453
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.01
Estimated Total Size (MB): 0.01
----------------------------------------------------------------

Thank you very much!

I don’t think it’s necessary to partition the data into batches (it converges even if I use all the dataset at once in Keras), but something is definitely working!
I wouldn’t say that it converges exactly, but it got down to MSE of 7 in about 500 epochs.

Also, output of the summary makes much more sense now.