Null gradients in torch.nn tutorial

Greetings all! First time poster, apologies for any faux pas etc.

I’ve been systematically working my way through the excellent tutorial here, and have encountered a problem with the gradients. Specifically, the gradients of the weights are all identically zero from the very first step. Hence even if it weren’t a single-layer network, it can’t be the usual vanishing gradients problem, but some issue with the implementation. Note that the biases do not suffer this problem. The network can still be trained, but only the biases change: the gradients of the weights remain zero for all times.

One can reproduce the issue by downloading and running the corresponding notebook. I’m referring specifically to the first half, where the author implements the NN without using torch.nn. In place of running multiple epochs, one can simply do

xb = x_train[:64]
yb = y_train[:64]

pred = model(xb)
loss = loss_func(pred, yb)
loss.backward()

print(weights.grad)

The output, at least on my local machine (using CPU), is zero. In particular, the problem is not the subsequent call to weights.grad.zero() (again, the biases do not have this problem).

In my attempts to debug the issue, I’ve tracked the problem to the dot product operation in the model() function. Consider the following cell, which re-creates the issue: xb@weights returns null gradients, while 2*weights behaves as expected:

weights = torch.randn(784,10)/math.sqrt(784)    # fresh weight initialization
weights.requires_grad_()

xb = x_train[:64]

prod = xb @ weights    # this yields null gradients...
#prod = 2*weights       # but this yields correct gradients! (in this case, 2)

iden = torch.ones([prod.shape[0],prod.shape[1]]) 
prod.backward(iden)
print(weights.grad)

I do not understand this behaviour; in particular, prod retains the gradient flag regardless of which of the two lines we use. So, my question is, (1) why are the gradients vanishing with @, and (2) is this intentional, or an oversight in the online example?

Many thanks, and apologies for my naivety!

Hi,

This would be weird indeed.
I cannot reproduce your observation, running the following code returns gradients for both:

import torch
import math

weights = torch.randn(784,10)/math.sqrt(784)    # fresh weight initialization
weights.requires_grad_()

xb = torch.randn(64, 784)

prod = xb @ weights    # this yields null gradients...
# prod = 2*weights       # but this yields correct gradients! (in this case, 2)

iden = torch.ones([prod.shape[0],prod.shape[1]]) 
prod.backward(iden)
print(weights.grad)

Hi albanD,

Indeed, very strange! Your example works as expected. But what happens if you replace xb with the training data? That is, construct x_train as in the tutorial (self-contained cell below). In my case this returns null gradients; can you reproduce this?

import torch
import pickle
import gzip
import math

PATH_TO_MNIST = '/full/path/to/MNIST/data/'    # replace with path on local machine!
FILENAME = 'mnist.pkl.gz'

# open gzip file in mode for reading binary (`rb`) data:
with gzip.open(PATH_TO_MNIST + FILENAME, 'rb') as file:
    ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(file, encoding="latin-1")
    
# convert from numpy array to PyTorch tensor:
x_train, y_train, x_valid, y_valid = map(torch.from_numpy, (x_train, y_train, x_valid, y_valid))

# initialize weights:
weights = torch.randn(784,10)/math.sqrt(784)
weights.requires_grad_()

xb = x_train[:64]    # mini-batch of size 64

prod = xb @ weights

iden = torch.ones([prod.shape[0],prod.shape[1]]) 
prod.backward(iden)
print(weights.grad)

Could you print the full type of xb please?
I tried different types but couldn’t reproduce this… It either raises an error because of type mismatch or it returns proper gradients.

Ah! You are correct: contrary to my initial impression, the gradients are not all null. The portion of the gradient tensor returned by print() was filled with zeros – unsurprisingly, since the elements of the mini-batch are sparse matrices – but there are indeed some non-zero elements buried in the middle of it. The first non-zero gradient turns out to be weight.gradient[71]. Thanks for the sanity check!

(I confess this is a bit mysterious to me: the first non-zero element of the mini-batch is xb[0,152]. I could not reproduce, with pen and paper, why the corresponding gradient appears to be weights.grad[97]. But this is technically a separate question).

Hi,

This most likely happens due to sparsity of the different elements?

Note that if you are curious about a given Tensors gradient in a more complex example, you can use hook to print them. For example here, you can do prod.register_hook(print) and it will print the gradient of prod when it’s computed. You can do the same with any other Tensor involved in the backward.