A linear layer does not converge to the analytical solution

Hello,

as far as I know, a single linear layer with a single output neuron should work completely the same as linear regression. I am trying to achieve this, but I had no success so far.

I have the following code for initialization and data loading:

import h5py
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
import numpy as np
import torch

torch.set_default_tensor_type('torch.DoubleTensor')

filename = '../kernels/A600-1.h5'

f = h5py.File(filename, 'r')
X_train = f['kernels/train_kernel'][:,:]
y_train = f['vectors/train_vector'][:]
X_test = f['kernels/test_kernel'][:,:]
y_test = f['vectors/test_vector'][:]
f.close()

Then I compute linear regression results with the following code:

lin_model = LinearRegression(fit_intercept=False)
lin_model.fit(X_train, y_train)
y_predicted = lin_model.predict(X_test)
error = mean_absolute_error(y_test, y_predicted)
print(error)

For learning with PyTorch Linear layer I use this code:

X2_train = torch.Tensor(X_train).double()
y2_train = torch.Tensor(y_train.reshape(-1,1)).double()
X2_test = torch.Tensor(X_test).double()
y2_test = torch.Tensor(y_test.reshape(-1,1)).double()
model = torch.nn.Linear(600, 1, bias=False)
model.double()
criterion = torch.nn.MSELoss(reduction='elementwise_mean')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# optimizer = torch.optim.SGD(model.parameters(), lr=1e-8)
model.train()        
for epoch in range(int(1e6)):
    optimizer.zero_grad()   # Reset gradients
    model.zero_grad()   # Just to be sure
    y_predicted = model(X2_train)   # Forward pass: predict
    loss = criterion(y_predicted, y2_train)   # Forward pass: calculate the loss
    loss.backward()   # Backpropagation: calculate the gradients
    optimizer.step()   # Update the weights
    if epoch%1000==0:
        print('Epoch: {} - loss: {}'.format(epoch, loss.item()))
model.eval()
model.double()
y_predicted2 = model(X_test).detach().numpy()
error2 = mean_absolute_error(y_test, y_predicted2)
print(error2)

I have also tried to use LBFGS optimizer, the code is similar:

X2_train = torch.Tensor(X_train).double()
y2_train = torch.Tensor(y_train.reshape(-1,1)).double()
X2_test = torch.Tensor(X_test).double()
y2_test = torch.Tensor(y_test.reshape(-1,1)).double()
model = torch.nn.Linear(600, 1, bias=False)
model.double()
criterion = torch.nn.MSELoss(reduction='elementwise_mean')
optimizer = torch.optim.LBFGS(model.parameters(), lr=1.0, max_iter=1000, history_size=10000)
model.train()        
for epoch in range(int(3e2)):
    def closure():
        optimizer.zero_grad()   # Reset gradients
        model.zero_grad()   # Just to be sure
        y_predicted = model(X2_train)   # Forward pass: predict
        loss = criterion(y_predicted, y2_train)   # Forward pass: calculate the loss
        loss.backward()   # Backpropagation: calculate the gradients
        print('Epoch: {} - loss: {}'.format(epoch, loss.item()))
        return loss
    optimizer.step(closure)   # Update the weights
model.eval()
model.double()
y_predicted2 = model(X_test).detach().numpy()
error2 = mean_absolute_error(y_test, y_predicted2)
print(error2)

I am using 600 weights for each, LinearRegression and Linear layer, both with no bias (I tried also with bias on but with no significant improvement), so they should work completely the same. Both are learning from the same data, 600 samples.
The analytical solution of LinearRegression has L1 test error 0.003666.
The Adam optimizer just runs for ages without converging to an analytical solution. The LBFGS converges to the test error 0.039252, which is almost ten times higher.

It is also possible to check weights with:

print(lin_model.coef_)
for p in model.parameters():
    print(p)  

They are different as I would expect if they converged to different minima. However, this is just a linear regression, it is a convex problem, so it should converge to the exactly same weights.

Can you see where the problem is?

My PyTorch version is 0.4.1.post2.

I can also send you training data with which I achieved these results.

Hi,

Gradient descent is really bad for optimization :slight_smile:
In particular, you have a smooth convex function for which gd will have a bad convergence rate (one over t?). The original Adam does not always converge even in the convex case.
For the gradient descent methods, precise tuning of learning rate (and use of momentum) might be necessary to converge fast enough.

I am less familiar with LBFGS so I’m not sure why it converges to a different point.

I don’t know how LinearRegression from sklearn works. But does it really just minimizes the average l2 difference? Doesn’t it contain some regularization? Why do you optimize the l2 and compare the final l1 norms?

Ah, I didn’t know this about Adam.
I also tried a higher learning rate, but if it is higher than 1e-8 for SGD, it explodes. However, if it is this low, it could learn ages. I tried momentum as well. Actually, I can not think of something I haven’t tried yet.

The X_train is a kernel matrix and this code does the same as LinearRegression:

import numpy as np
import numpy.linalg as la
from sklearn.metrics import mean_absolute_error

alpha = np.dot(la.inv(X_train), y_train)
y_predicted3 = np.dot(X_test, alpha)
error3 = mean_absolute_error(y_test, y_predicted3)
print(error3)

There is no regularization. I tried to optimize L1, but as it is convex problem, I thought it would converge to the same result. I could have added a commented line with L1Loss, I tried it, but it did not converge to the analytical solution as well. I compare L1 errors because it is easier to see the physical meaning of the problem.

L1Loss and L2Loss are both convex, but they are two different functions, so optimizing one or the other will give different results !
For example, considering a 1D problem without bias, if your samples are (input, output): (1, 0.5) and (2, 0). Then your loss function will look like this:


As you can see, the absolute value and the squared norm don’t have the minimum at the same point

It is surprising that you cannot use SGD with any value bigger than 1e-8. Maybe you want to normalize your dataset to avoid very bad conditioning that significantly impact simple gradient based methods.

Ok, but anyway LinearRegression is using L2 norm and I used it for the Linear layer too. So the conditions for both are the same and as such, they should have the same errors in the end.
Even if I normalize the data, LBFGS has a different result. The maximum learning rate for SGD is in order 1e-3 and it could again learn for ages.

I think that the problem is in the data. It has a very high condition number. May it be the problem?
If someone had encountered this problem, could you let me know, how did you solve it?
I already tried rescaling input and output, but it did not help.

I think the problem is that the average of gradient is small in every batch, because you feed the whole dataset into the model and optimizer at one time.

I am not sure how large your dataset is. When the batch size is too large and the data is too noisy, the final gradient will be averaged to near zero. Then, the model will converge at an extremely slow rate.

Try to feed your data one by one to the model and optimizer. See what happens.

1 Like

Hi, Did you make any progress in this problem? I got the same problem. I am trying to use a one-layer linear network to solve for a least square approximation problem. The result given by optimizer is ten times higher than the explicit expression. Did you finally find a good solution for this?

Thanks,
Yao

I had similar problem on a dataset where the number of samples was just x10 the number of parameters. Data was noisy and optimization would get stuck producing good results only for the frequent samples. Decreasing the batch size helped as it allowed the optimizer to better explore parameters for the less-frequent samples as well.

I got good results with batches of size 1-2. This significantly increased duration of a single epoch, but I could reduce the number of iterations and arrive at much better solution than in the past.

1 Like

I got stuck with this problem forever, and finally reducing the batch size to 2 helped to improve the performance. Thank you so much!!!
I expect the model to achieve lower l2 loss though, do you have any other suggestions?