Parameters with zero .grad change in value - how to exclude some parameters from backpropagation update

I have a custom Module with certain parameters. One of them is a ModuleDict() which has 2 parameters(Linear modules) for key values say ‘a’ and ‘b’ . Depending on the training example(one minibatch) provided, only one of these parameters should be used per minibatch in the forward pass - they key value for the moduledict is decided by the minibatch. I might be seriously misunderstanding pyTorch and NN’s in general here, but is it not the case that if a certain parameter is not used at all in the forward pass, its gradient( should be zero AND it should not change in value at all for that minibatch? The gradient is zero, however the value of both the parameters(in this case, the weights and biases of the Linear modules) change every minibatch - only one of them should change.

A side question - Is there a good way to visualize the history a variable while debugging in python? Preferably not inside a jupyter notebook.

You’re not using an optimizer with momentum by chance?
For those, the momentum will cause updates even when the gradients are zero.
(It’s also doing funny things to the statistics, probably, but with dropout we rarely think about it too much.)

As a trick you can (at least you could last time I checked) set the gradients to “None” instead of just zeroing them and then the parameters won’t be updated.

Some people like TensorboardX.

Best regards



You’re exactly right, I’m using Adam. I checked the update rule and its the momentum vectors that are updating the gradients. I should give this more thought. Thank you!

1 Like

Hi @tom, this solution doesn’t seem to work for me. I only want to make certain parts of the gradient None. Is that possible?

I don’t want to update parameters where the grad is zero, so I try to make them None like this


but I’m getting the error can't assign a NoneType to a torch.cuda.FloatTensor

You can zero part of the gradients if that’s what you need. Note, however, that some optimizers might react to this (anything weight decay, things taking tensor norms).

Best regards


Hi @tom, thanks for replying. I am in fact zeroing out part of the gradient and that is not a problem. The problem is that even though that part of the gradient is zero, the corresponding parameters still get updated because of Adam’s momentum. For that reason, I was trying to set part of the gradient to None instead of 0 but it throws the error I’ve mentioned.

Do you know how to solve this?