Pytorch convergence out-of-the-box

Hi all,
I am following a pytorch tutorial and find some behavior I did not expect. In particular, the convergence of a minimizer is much slower when using autograd and it is even slower still when I wrap my (linear) model in a pytorch class.

For starters, in the code below:

import numpy as np 

# Compute every step manually
# Linear regression
# f = w * x 
# suppose : f = 2 * x
X = np.array([1, 2, 3, 4], dtype=np.float32)
Y = np.array([2, 4, 6, 8], dtype=np.float32)

w = 0.0

# model output
def forward(x):
    return w * x

# loss = MSE
def loss(y, y_pred):
    return ((y_pred - y)**2).mean()

# J = MSE = 1/N * (w*x - y)**2
# dJ/dw = 1/N * 2x(w*x - y)
def gradient(x, y, y_pred):
    return np.dot(2*x, y_pred - y).mean()

print(f'Prediction before training: f(5) = {forward(5):.7f}')

# Training
learning_rate = 0.01
n_iters = 100

for epoch in range(n_iters):
    # predict = forward pass
    y_pred = forward(X)

    # loss
    l = loss(Y, y_pred)
    
    # calculate gradients
    dw = gradient(X, Y, y_pred)

    # update weights
    w -= learning_rate * dw

    if epoch % 2 == 0:
        print(f'epoch {epoch+1}: w = {w:.7f}, loss = {l:.7f}')
     
print(f'Prediction after training: f(5) = {forward(5):.7f}')

This code will converge on the correct linear weight in about 20 iterations. (This is setting machine precision of 7 digits for float32). And the loss stops decreasing around iteration 13.

But the next code (where autograd is used) never converges on the correct weight. It also does not give the correct answer for the test element, within machine precision. The loss stops decreasing around iteration 70.

# Here we replace the manually computed gradient with autograd
import torch
# Linear regression
# f = w * x 

# here : f = 2 * x
X = torch.tensor([1, 2, 3, 4], dtype=torch.float32)
Y = torch.tensor([2, 4, 6, 8], dtype=torch.float32)

w = torch.tensor(0.0, dtype=torch.float32, requires_grad=True)

# model output
def forward(x):
    return w * x

# loss = MSE
def loss(y, y_pred):
    return ((y_pred - y)**2).mean()

print(f'Prediction before training: f(5) = {forward(5).item():.7f}')

# Training
learning_rate = 0.01
n_iters = 100

for epoch in range(n_iters):
    # predict = forward pass
    y_pred = forward(X)

    # loss
    l = loss(Y, y_pred)

    # calculate gradients = backward pass
    l.backward()

    # update weights
    #w.data = w.data - learning_rate * w.grad
    with torch.no_grad():
        w -= learning_rate * w.grad
    
    # zero the gradients after updating
    w.grad.zero_()

    if epoch % 1 == 0:
        print(f'epoch {epoch+1}: w = {w.item():.7f}, loss = {l.item():.7f}')

print(f'Prediction after training: f(5) = {forward(5).item():.7f}')

Lastly when I use the pytorch class Linear for my model, the performance gets quite bad.

# 1) Design model (input, output, forward pass with different layers)
# 2) Construct loss and optimizer
# 3) Training loop
#       - Forward = compute prediction and loss
#       - Backward = compute gradients
#       - Update weights

import torch
import torch.nn as nn

# Linear regression
# f = w * x 

# here : f = 2 * x

# 0) Training samples, watch the shape!
X = torch.tensor([[1], [2], [3], [4]], dtype=torch.float32)
Y = torch.tensor([[2], [4], [6], [8]], dtype=torch.float32)

n_samples, n_features = X.shape
print(f'#samples: {n_samples}, #features: {n_features}')
# 0) create a test sample
X_test = torch.tensor([5], dtype=torch.float32)

# 1) Design Model, the model has to implement the forward pass!
# Here we can use a built-in model from PyTorch
input_size = n_features
output_size = n_features

# we can call this model with samples X
#model = nn.Linear(input_size, output_size)


class LinearRegression(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(LinearRegression, self).__init__()
        # define diferent layers
        self.lin = nn.Linear(input_dim, output_dim)

    def forward(self, x):
        return self.lin(x)

model = LinearRegression(input_size, output_size)


print(f'Prediction before training: f(5) = {model(X_test).item():.7f}')

# 2) Define loss and optimizer
learning_rate = 0.01
n_iters = 10000

loss = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# 3) Training loop
for epoch in range(n_iters):
    # predict = forward pass with our model
    y_predicted = model(X)

    # loss
    l = loss(Y, y_predicted)

    # calculate gradients = backward pass
    l.backward()

    # update weights
    optimizer.step()

    # zero the gradients after updating
    optimizer.zero_grad()

    if epoch % 100 == 0:
        [w, b] = model.parameters() # unpack parameters
        print(f'epoch {epoch+1}: w = {w[0][0].item():.7f}, loss = {l.item():.7f}')

print(f'Prediction after training: f(5) = {model(X_test).item():.7f}')

Here the code again never converges to the correct weight, and even after 2000 iterations the loss is still decreasing.

Is these something basic I am missing about how pytorch handles autograd and class wrapping? Is the default precision float16 or something? This behavior is unexpected for me.

Thank you very much!

You could add some debug print statements to the code e.g. via:

print('epoch {}, output {}, loss {}, grad {}'.format(
    epoch, y_pred, l, w.grad))

to compare the numpy vs. PyTorch approach.
Once done you’ll see that the gradients don’t seem to match:

pytorch
Prediction before training: f(5) = 0.0000000
epoch 0, output tensor([0., 0., 0., 0.], grad_fn=<MulBackward0>), loss 30.0, grad -30.0
epoch 1: w = 0.3000000, loss = 30.0000000

numpy
Prediction before training: f(5) = 0.0000000
epoch 0, output [0. 0. 0. 0.], loss 30.0, grad -120.0
epoch 1: w = 1.2000000, loss = 30.0000000

and that the numpy run converges much faster than the PyTorch one.
Looking at your code I guess the gradient calculation in numpy might be the issue:

# J = MSE = 1/N * (w*x - y)**2
# dJ/dw = 1/N * 2x(w*x - y)
def gradient(x, y, y_pred):
    return np.dot(2*x, y_pred - y).mean()

Here you are probably trying to call mean() on the dot output to scale the gradients by 1/N = 1/4.
However, given your inputs np.dot will return a scalar and thus no scaling is done.
Also, the initial gradient differ by a scale of 4 which would also point to this code.
If you ad the division by 4 (the number of samples) in def gradient the numpy model will converge in approx. the same way as the original PyTorch model. Alternatively, if you scale the PyTorch loss by *4 it’ll also converge faster as the original numpy model.

Thanks very much! I had only explored the loss and the weights but not the gradient.

This still only answers the difference between numpy and autograd method, and does not address why the performance changes so much when I use the pytorch Linear model. Any thoughts?

Thanks again!

The nn.Linear approach would have the same gradients as the manual approach in PyTorch so I would assume scaling the loss by 4 would also yield the same numpy result. Did you try it?

The manual approach was off by a factor of 4 from autograd, as you said. But the scaling problem was only in manual approach, not in autograd. I expected the nn.Linear approach to match the autograd approach, no matter what mistake I made with the manual approach.

Either way, I figured it out. The nn.Linear model by default includes a bias term, which slows the optimizer a little bit. It takes longer because there are more parameters to tune.

When I set bias=False in the nn.Linear model, then the nn.Linear approach matches the autograd approach (and the manual approach too, once I fixed it).

Thanks again for your help!