Linear regression, PyTorch: MSE ~ Variance, Keras: 0 ~ MSE << Variance

czlowiekrakieta · January 12, 2019, 2:11pm

Hi there,

My problem is very simple, toy one, even, yet I struggle to find solution. I created 2D array X, shape N x D, of random inputs, then random vector b of size D. Output is y = X @ b, with no error term, simple matrix multiplication. I wanted my model to learn this linear dependency, but PyTorch refused to cooperate. Namely, it’s MSE doesn’t go below variance. I created completely analogous model in Keras and it worked well, just as expected.

Code is here:

import numpy as np
from keras.layers import Dense, Activation
from keras.models import Sequential
from keras.optimizers import SGD

from torch import nn, from_numpy
from torch.autograd import Variable
from torch import optim
from torch.nn.functional import mse_loss

X = np.random.randn(100, 5).astype(np.float32)
beta = np.random.randn(5).astype(np.float32)

y = X @ beta

tX = from_numpy(X)
ty = from_numpy(y)

keras_model = Sequential(layers=[Dense(input_shape=(5,), units=20, activation='relu'), 
                                 Dense(units=1)])
torch_model = nn.Sequential(nn.Linear(5, 20), nn.ReLU(), nn.Linear(20, 1))

opt = optim.SGD(torch_model.parameters(), lr=1e-3, momentum=0.8)
keras_model.compile(SGD(lr=1e-3, momentum=0.8), loss='mse')

ITERS = 100

for i in range(ITERS):
#     torch_model.zero_grad()
    loss = mse_loss(torch_model(tX), ty)
    opt.zero_grad()
    loss.backward()
    opt.step()

print(mse_loss(torch_model(tX), ty))

print(y.var())

keras_model.fit(X, y, batch_size=X.shape[0], epochs=ITERS)

PyTorch version: 1.0.0
Keras version: 2.0.4

Everything trained on CPU. I didn’t add many more printers in this code, but turns out that weights are updating, albeit slightly.

Probably I’m overseeing something completely trivial, but, well, maybe I’m blind.

ptrblck · January 12, 2019, 4:16pm

It looks like the shape of your target ty might be wrong.
Currently it has the shape [100], while your model’s output is [100, 1].
This means that internally your target will be broadcasted such that the operation
(torch_model(tX) - ty) will yield a tensor of shape [100, 100].
This is most likely wrong.
Try to add dim1 to your target using ty = ty.unsqueeze(1) before passing it to your loss and your model should work.
Let me know, if that helps.

czlowiekrakieta · January 12, 2019, 4:36pm

It did! MSE dives significantly below variance, but it’s still worse than Keras (ceteris paribus, just as in the OP). Is it expected?

Nonetheless, thank you!