Greetings all! First time poster, apologies for any faux pas etc.
I’ve been systematically working my way through the excellent tutorial here, and have encountered a problem with the gradients. Specifically, the gradients of the weights are all identically zero from the very first step. Hence even if it weren’t a single-layer network, it can’t be the usual vanishing gradients problem, but some issue with the implementation. Note that the biases do not suffer this problem. The network can still be trained, but only the biases change: the gradients of the weights remain zero for all times.
One can reproduce the issue by downloading and running the corresponding notebook. I’m referring specifically to the first half, where the author implements the NN without using torch.nn. In place of running multiple epochs, one can simply do
xb = x_train[:64]
yb = y_train[:64]
pred = model(xb)
loss = loss_func(pred, yb)
loss.backward()
print(weights.grad)
The output, at least on my local machine (using CPU), is zero. In particular, the problem is not the subsequent call to weights.grad.zero() (again, the biases do not have this problem).
In my attempts to debug the issue, I’ve tracked the problem to the dot product operation in the model() function. Consider the following cell, which re-creates the issue: xb@weights returns null gradients, while 2*weights behaves as expected:
weights = torch.randn(784,10)/math.sqrt(784) # fresh weight initialization
weights.requires_grad_()
xb = x_train[:64]
prod = xb @ weights # this yields null gradients...
#prod = 2*weights # but this yields correct gradients! (in this case, 2)
iden = torch.ones([prod.shape[0],prod.shape[1]])
prod.backward(iden)
print(weights.grad)
I do not understand this behaviour; in particular, prod retains the gradient flag regardless of which of the two lines we use. So, my question is, (1) why are the gradients vanishing with @, and (2) is this intentional, or an oversight in the online example?
Many thanks, and apologies for my naivety!