I am trying to switch from Keras + TensorFlow to PyTorch.

To begin I am trying to reimplement the most basic model from my project, that is a small densely connected network.

However, despite looking the same and using the same optimizer (default Adam), Keras one converges in about 15 epochs with MSE of around 0.3, while PyTorch variant reaches MSE of around 8 billions.

I am almost certain that there is some basic mistake on my part, so I ask you to help me find it.

Here’s the summary of the Keras model:

```
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) (None, 44) 0
_________________________________________________________________
dense_10 (Dense) (None, 44) 1980
_________________________________________________________________
activation_10 (Activation) (None, 44) 0
_________________________________________________________________
dense_11 (Dense) (None, 32) 1440
_________________________________________________________________
activation_11 (Activation) (None, 32) 0
_________________________________________________________________
dense_12 (Dense) (None, 1) 33
_________________________________________________________________
activation_12 (Activation) (None, 1) 0
=================================================================
Total params: 3,453
Trainable params: 3,453
Non-trainable params: 0
_________________________________________________________________
```

And here’s the one for Pytorch:

```
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Linear-1 [-1, 8962, 44] 1,980
ReLU-2 [-1, 8962, 44] 0
Linear-3 [-1, 8962, 32] 1,440
ReLU-4 [-1, 8962, 32] 0
Linear-5 [-1, 8962, 1] 33
================================================================
Total params: 3,453
Trainable params: 3,453
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 1.50
Forward/backward pass size (MB): 10.46
Params size (MB): 0.01
Estimated Total Size (MB): 11.98
----------------------------------------------------------------
```

My input tensor is created from the numpy array with the following command and result:

```
x_train = torch.from_numpy(X_train_str).float().to(device)
tensor([[ 0.1957, 1.4893, 0.7181, ..., -0.1391, 0.3075, -0.2688],
[ 0.3367, -0.3090, -0.6731, ..., -0.1391, 0.3075, -0.2688],
[ 0.8814, -0.2180, -0.3253, ..., -0.1391, -3.2525, 3.7199],
...,
[-0.5159, -1.0219, -0.6731, ..., 7.1915, -3.2525, -0.2688],
[ 0.4028, 0.3993, 1.4138, ..., -0.1391, 0.3075, -0.2688],
[-0.0705, 0.2153, 0.7181, ..., -0.1391, 0.3075, -0.2688]],
device='cuda:0')
```

And the output (a single vector with the dependent variable values):

```
y_train = torch.from_numpy(np.array(y_train)).float().to(device)
tensor([ 6.8199, 17.4941, 14.3596, ..., 11.4096, 6.7144, 14.9953],
device='cuda:0')
```

Input and output for Keras are the original arrays:

```
X_train_str:
array([[ 0.19572722, 1.4892545 , 0.71813392, ..., -0.13905308,
0.307455 , -0.26882353],
[ 0.33670636, -0.30899649, -0.67311472, ..., -0.13905308,
0.307455 , -0.26882353],
[ 0.8814075 , -0.21798822, -0.32530256, ..., -0.13905308,
-3.25250847, 3.71991241],
...,
[-0.51589407, -1.02193099, -0.67311472, ..., 7.19149825,
-3.25250847, -0.26882353],
[ 0.4027681 , 0.39929304, 1.41375824, ..., -0.13905308,
0.307455 , -0.26882353],
[-0.07051303, 0.21533594, 0.71813392, ..., -0.13905308,
0.307455 , -0.26882353]])
y_train:
8556 6.819938
8742 17.494090
11390 14.359563
36 10.112089
4902 10.639344
...
11284 13.418195
5191 7.310966
5390 11.409576
860 6.714400
7270 14.995305
Name: Price Per Meter Squared, Length: 8962, dtype: float64
```

Complete definition of the Network in PyTorch:

```
device = torch.device('cuda')
x_train_tensor = torch.from_numpy(X_train_str).float().to(device)
y_train_tensor = torch.from_numpy(np.array(y_train)).float().to(device)
model = torch.nn.Sequential(
torch.nn.Linear(44, 44),
torch.nn.ReLU(),
torch.nn.Linear(44, 32),
torch.nn.ReLU(),
torch.nn.Linear(32, 1),
).to(device)
#target = target.view(-1,1)
loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters())
summary(model, (x_train_tensor.shape))
for t in range(500):
y_pred = model(x_train_tensor)
loss = loss_fn(y_pred, y_train_tensor)
print(t, loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

and the same in Keras:

```
inputs_str = Input(shape=X_train_str[0].shape)
x = layers.Dense(44)(inputs_str)
x = layers.Activation('relu')(x)
x = layers.Dense(32)(x)
x = layers.Activation('relu')(x)
x = layers.Dense(1)(x)
output_str = layers.Activation('linear')(x)
model_str = Model(inputs=inputs_str, outputs=output_str)
model_str.compile(optimizer='adam', loss='mse', metrics=['mae'])
model_str.summary()
model_str.fit(X_train_str, y_train, validation_data=(X_val_str, y_val), epochs=500l, batch_size=64, callbacks=[model_str_save])
```