For the weights, we set requires_grad
after the initialization, since we don’t want that step included in the gradient. (Note that a trailing _
in PyTorch signifies that the operation is performed in-place.)
I found this in the following link : Pytorch nn tutorial
But why is
requires_grad not set in the initialisation of weights, unlike biases?
I think the reason is in the sentence you linked: " we don’t want that step (I guess the division in that example) included in the gradient".
The initialization is just filling the original values and should not be considered when computing derivatives of the net.