Linear regression

I am new study ML,my idea is use x_data( Height、 weight)to predict y_data(life), y=ax(1)+bx(2)+c, then x(1) is Height or Weight should not change the regression result, but through the code the location affect regression results. my question 1、if not use SGD optimizer, the linear regression is bad, loss is very big, and the regression result is same, a=-0.4912,b=0.2071, but I think a、b should change the number. 2、if use SGD optimizer,the result is not the same and every time there are different values, so i am confused.

import torch
import torch.nn as nn
from torch.autograd import Variable
import numpy as np

x_data = [[1.3, 2.7], [2.4, 3.3],[4.7, 5.2], [5.6, 6.1], [6.9, 7.9]]
#x_data = [[2.7, 1.3], [3.3, 2.4],[5.2, 4.7], [6.1, 5.6], [7.9, 6.9]]  #Column to column change
y_data = [[6.], [9.], [15.], [18.], [21.]]
y=Variable(torch.Tensor(y_data)).view(5,-1) ;print(x.size(),y.size())
#our hypothesis xw+b
#cost Criterion  #minimize
#train the model
for step in range(200001):
    #our hypothesis
    if step %10==0:
        print(step,'cost: ',,'\nprediction:\n',

In case a), which is the code as posted, nothing is really happening, there is no update to your model’s parameters. That is why the regression is bad, the loss is huge and the result never changes. When you do: cost.backward() all you’re doing is calculating the gradient of the loss wrt the parameters and since you’re not zeroing the gradient, you actually accumulate these values, so they will grow bigger and bigger. However you never use this information (the gradient) to improve your parameters.

In case b) the parameters do change since the optimizer.zero_grad() and optimizer.step() take care of “resetting” the gradients and updating your parameters, respectively.

Also the “results” change when you alter the order of your input features because they are nowhere near convergence. That means that even when you do case b) which is the correct one, you would have to do many more iterations for the algorithm to actually find the correct values for your parameters. Then the order of the input features wouldn’t alter the result (other than their order of course). Oh and this is assuming that you set the right learning rate, but it shouldn’t be that difficult for this task.

However this task has a simple analytical solution:
Where X is a batch_size x features+1 matrix (your input features with a column of ones for the bias term)

If you’re curious the resulting parameter values are a = 2.89, b = -0.21, c = 2.76

And this simple script should get the analytical solution in no time (gradient descent using Adam optimizer found these values after ~400,000 iterations, I don’t know about vanilla SGD but you’re free to wait for it to converge and/or play with the learning rate, tip: begin with a “large” value and decrease it over time)

import torch

x_data = [[1.3, 2.7], [2.4, 3.3],[4.7, 5.2], [5.6, 6.1], [6.9, 7.9]]
#x_data = [[2.7, 1.3], [3.3, 2.4],[5.2, 4.7], [6.1, 5.6], [7.9, 6.9]] 
y_data = [[6.], [9.], [15.], [18.], [21.]]

x = torch.FloatTensor(x_data)
X =, torch.ones(x.size()[0])), 1)
y = torch.FloatTensor(y_data).squeeze_()
A = torch.inverse(, X))
W =, X.t()), y)


And one last thing, if you’re up for some (light) math try finding the analytical solution by yourself. All you need is a little algebra and simple differential calculus.



thank you very much for you patient and detailed answer,your answer perfectly solved my problem.