I have a custom Module with certain parameters. One of them is a ModuleDict() which has 2 parameters(Linear modules) for key values say ‘a’ and ‘b’ . Depending on the training example(one minibatch) provided, only one of these parameters should be used per minibatch in the forward pass - they key value for the moduledict is decided by the minibatch. I might be seriously misunderstanding pyTorch and NN’s in general here, but is it not the case that if a certain parameter is not used at all in the forward pass, its gradient(param.grad.data.sum()) should be zero AND it should not change in value at all for that minibatch? The gradient is zero, however the value of both the parameters(in this case, the weights and biases of the Linear modules) change every minibatch - only one of them should change.
A side question - Is there a good way to visualize the history a variable while debugging in python? Preferably not inside a jupyter notebook.
You’re not using an optimizer with momentum by chance?
For those, the momentum will cause updates even when the gradients are zero.
(It’s also doing funny things to the statistics, probably, but with dropout we rarely think about it too much.)
As a trick you can (at least you could last time I checked) set the gradients to “None” instead of just zeroing them and then the parameters won’t be updated.
Hi @tom, thanks for replying. I am in fact zeroing out part of the gradient and that is not a problem. The problem is that even though that part of the gradient is zero, the corresponding parameters still get updated because of Adam’s momentum. For that reason, I was trying to set part of the gradient to None instead of 0 but it throws the error I’ve mentioned.