Why is my weight in the linear layer not changing?

import torch
from torch import nn
import matplotlib.pyplot as plt

weight = 0.3
bias = 0.9

X = torch.arange(0, 200, 1.0).unsqueeze(dim=1)
y = X * weight + 0.9

X.shape, y.shape

train_split = int(0.8 * len(X))
X_train = X[:train_split]
y_train = y[:train_split]
X_test = X[train_split:]
y_test = y[train_split:]

len(X_train), len(y_train),len(X_test),len(y_test)

device = "cuda" if torch.cuda.is_available() else "cpu"
device

# X_train_cpu = X_train
# y_train_cpu = y_train
# X_test_cpu = X_test
# y_test_cpu = y_test

X_train = X_train.to(device)
y_train = y_train.to(device)
X_test = X_test.to(device)
y_test = y_test.to(device)

X_train[:10], y_train[:10], X_test[:10], y_test[:10]

class LinearRegressionModel(nn.Module):
  def __init__(self):
    super().__init__()
    self.linear_layer = nn.Linear(in_features = 1, out_features = 1)

  def forward(self, x:torch.Tensor) -> torch.Tensor :
    return self.linear_layer(x)

def plot_prediction(train_data = X_train.cpu(),
                    train_labels = y_train.cpu(),
                    test_data = X_test.cpu(),
                    test_labels = y_test.cpu(),
                    predictions = None):

  """
  plots training data, test data and comapres predictions
  """

  plt.figure(figsize=(10, 7))
  plt.scatter(train_data, train_labels, c="b", s = 4, label = "Training data")
  plt.scatter(test_data, test_labels, c="g", s = 4, label = "Test data")

  if predictions is not None:
    plt.scatter(test_data, predictions, c="r", s = 4, label = "predictions")

  plt.legend(prop={"size" : 14})

plot_prediction()

torch.manual_seed(42)

linear_model = LinearRegressionModel()
linear_model.to(device)
next(linear_model.parameters()).device

linear_model.state_dict()

list(linear_model.parameters())

loss_fn = nn.L1Loss()
optimizer = torch.optim.SGD(params = linear_model.parameters(), lr = 0.01)
optimizer

epochs = 300

for epoch in range(epochs):
  linear_model.train()

  y_preds = linear_model(X_train)

  loss = loss_fn(y_preds, y_train)

  optimizer.zero_grad()

  loss.backward()

  optimizer.step()

  linear_model.eval()

  with torch.inference_mode():
    test_preds = linear_model(X_test)
    test_loss = loss_fn(test_preds, y_test)

  if epoch % 20 == 0:
    print(f"Epoch: {epoch}; Loss: {loss}; Test loss: {test_loss}")

linear_model.state_dict()

with torch.inference_mode():
  y_preds = linear_model(X_test)
y_preds

# y_preds_cpu = y_preds.cpu()
# y_preds_cpu.get_device()

plot_prediction(predictions = y_preds.cpu())

I am new to PyTorch and i tried to build my first training loop, but could anyone tell me why is it that no matter how much i change my learning rate value in the optimizer. My bias changes (by checking the state_dict) but the weight will always remain the same.
Another peculiar thing is that, the test_loss will also decrease very slowly across the epochs, and the learning rate value will sort of affect the value of the test_loss (and not in the diverging way). Is there some kind of bug here i missed out on?

Hi Kaiwen!

Your weight does not remain the same – it jumps back an forth between two
values that don’t change.

Because you are using L1Loss (together with the fact that your model is
very simple), your optimum – the parameter value for which your loss is
at its minimum – is at the bottom of V-shaped loss function. Then because
you are using a rather large learning rate – which is the step size used by
the optimizer – you simply jump back and forth across this V, so after two
steps, you’ve come back to where you were before. (This is for the weight;
the “scale” of the V that the bias sees, relative to the learning rate, is
different, so bias does not start out jumping back and forth across the V
and the bias value does make progress as you train.)

First try lowering your learning rate to lr = 0.0001. Then try changing your
loss function to loss_fn = nn.MSELoss() – this changes the shape of the
loss-function well from a V to a parabola. Finally try adding momentum to
your SGD optimizer, say, momentum = 0.95.

Play around with the learning rate and momentum with both loss functions
to get a feel for how these things can behave. You might also experiment
with the Adam optimizer.

Good luck!

K. Frank

ohhh i see, thanks for the clarification! I always thought that 0.01 learning rate is actually quite low already. But may I ask, what is the benefit of using a parabola shaped loss function to a V shape loss function? (in my intuition, im guessing its because the gradient on the parabola drops as it gets closer to the minimum so the optimizer takes smaller steps as it approaches the minimum)

And also, what does momentum usually do and how do we usually set its value? This is quite a new term for me

Hi Kaiwen!

I’d say that this is correct (although intuition can be a tricky business with
neural networks …).

Note, however, that this can help or hurt, depending on the details.

For the parabola (MSELoss) (in comparison with the V (L1Loss) and for
a given learning rate):

If you’re quite far away from the minimum, the gradient can be too large
and training will diverge. If you’re far from the minimum, but not too far,
the gradient will be large, so you take big steps toward the minimum and
train faster. As you get close to the minimum, the gradient gets smaller,
you take smaller steps, and training slows down, perhaps unhelpfully.
But as you get even closer to the minimum to where training for the V
starts jumping back and forth across the V, stopping progress, even though
the parabola training slows down because of small gradients, it does keep
making progress, even if slowly.

Momentum keeps a running average of previously computed gradients and
takes a step opposite the direction (that is, downhill) of that average gradient,
rather than just using the current gradient. This can help smooth out things
like jumping back and forth across a valley with steep sides and help you
follow the floor of the valley more gently downhill. Also, it will help to average
away noise, for example batch-to-batch noise caused by a small batch size.

As with most everything with neural networks, you would set your momentum
value empirically. (That’s just a fancy way of saying try out a bunch of values
and use the one that works best for your problem.)

Best.

K. Frank

ah thanks for the explanation. Appreciate it!