Why .grad is NoneType?

I am trying to update weights manually without using optimizer, but somehow model weights are NoneType…

class TwoLayerNet(torch.nn.Module):
    def __init__(self):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(784, 128)
        self.linear2 = torch.nn.Linear(128, 128)
        self.linear3 = torch.nn.Linear(128, 10)
        self.a1_relu = None
        self.a2_relu = None

    def forward(self, k):
        self.a1_relu = self.linear1(k).clamp(min=0)
        self.a2_relu = self.linear2(self.a1_relu).clamp(min=0)
        emb = self.linear3(self.a2_relu)
        return emb

model = TwoLayerNet()
criterion = torch.nn.MSELoss(reduction='sum')
x_train, y_train, x_val, y_val = get_data.get_mnist()
minibatch_size = 8
epochs = 100
av_loss = 0
ll = list(range(0, x_train.shape[0], minibatch_size))
learning_rate = 0.5

for e in range(epochs):
    for i in ll:
        k = x_train[i:i + minibatch_size]
        y = y_train[i:i + minibatch_size]
        k = torch.tensor(k, device="cuda", dtype=torch.float32, requires_grad=True)
        y = torch.tensor(y, device="cuda", dtype=torch.float32, requires_grad=True)

        y_pred = model.forward(k)
        y_pred = torch.argmax(y_pred, dim=1)
        loss = criterion(y_pred, y)
        av_loss += loss.item()

        with torch.no_grad():
            model.linear1.weight -= learning_rate * model.linear1.weight.grad
            model.linear2.weight -= learning_rate *  model.linear2.weight.grad
            model.linear3.weight -= learning_rate * model.linear3.weight.grad

here the weights of .grad values are None. Why So? Is there some problem in the way I compute loss?
Thank you!
P.S.I am updating them manually on purpose, I know about optimizers.

torch.argmax will break the computation graph here:

y_pred = torch.argmax(y_pred, dim=1)

Could you explain your use case a bit?

Oh, y_pred is the output of the last layer. So the predicted digit would correspond to an index, right?
y vector has actual digits [0, 2, 3, 5, 2, 4, 9, 8], while Y_pred is of the size (8, 10). So I should also convert y_pred to the same digit representation by taking max entry? What is the right way of handling it? Should I just convert y to one-hot encoding instead?

If you are dealing with a multi-class classification, you could use nn.CrossEntropyLoss as the criterion.
I assume your model outputs logits in the shape [batch_size, nb_classes]. If that's the case, you can directly use it in the criterion and use a target tensor containing the class indices in the shape [batch_size]`.

1 Like

oh, so MSE loss is not correct for multiclass if I created y to be one hot encoded instead?

You could use this approach, if you apply softmax on the output of your model to match the one-hot encoded targets.
This approach should work without code issues, however I’m skeptical, if that’s the best approach for a multi-class classification, since you would usually use cross entropy instead of mseloss.

1 Like