Grad is None confusion in the "what is torch.nn" tutorial

I’m following along What is torch.nn really? tutorial. There’s a setup part in the beginning where they define:

weights = torch.randn(784, 10) / math.sqrt(784)
weights.requires_grad_()

I thought I might as well just rewrite that as:

weights = torch.randn(784, 10, requires_grad=True) / math.sqrt(784)
assert weights.requires_grad == True

Everything that followed was fine until this loop:

lr = 0.5  # learning rate
epochs = 2  # how many epochs to train for

for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        start_i = i * bs
        end_i = start_i + bs
        xb = x_train[start_i:end_i]
        yb = y_train[start_i:end_i]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        with torch.no_grad():
            weights -= weights.grad * lr
            bias -= bias.grad * lr
            weights.grad.zero_()
            bias.grad.zero_()

where I got:

TypeError                                 Traceback (most recent call last)
<ipython-input-55-24454d388081> in <cell line: 0>()
     21         with torch.no_grad():
     22             print((weights.requires_grad, bias.requires_grad))
---> 23             weights -= weights.grad * lr
     24             bias -= bias.grad * lr
     25             weights.grad.zero_()

TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

Turns out that when I use the original definition:

weights = torch.randn(784, 10) / math.sqrt(784)
weights.requires_grad_()

this error is avoided, but I can’t figure out why? I know it’s got something to do with this division by sqrt of 784 because this also works:

weights = torch.randn(784, 10, requires_grad=True) # No division

Thanks in advance for any feedback.

A division is a differentiable operation and this weights is not a leaf tensor anymore. If you try to access its grad attribute you will see a warning explaining you are trying to access the gradient of a non-leaf.

1 Like

Thanks Piotr, that’s helpful. I see that weights variable is referring to a non-leaf node, but when we eventually compute a loss, shouldn’t that automatically result in a gradient calculation for every node in the underlying computational graph?

Two things I tried that didn’t work:

# Attempt 1
weights = torch.div(torch.randn(784, 10), math.sqrt(784), requires_grad=True)
#=> TypeError: div() received an invalid combination of arguments ...

and

# Attempt 2
weights = torch.randn(784, 10, requires_grad=True)
weights.div_(math.sqrt(784))
#=> RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

I feel like since the node is differentiable, there should be a gradient available on it regardless of whether it’s a leaf or not?

I’m not on my workstation but a warning should have been raised explaining how to access gradients from non-leaf tensors via .retain_grad. Search for this warning and don’t ignore it.

1 Like