I have a set of networks out of which only some of them are used in one forward pass and hence only their weights should be updated by a call to backward,
The caveat is that which of these networks are selected for forward pass depends on the input and hence when initialising the optimiser, I am passing parameters for all the networks.
This creates a problem while using an optimiser like Adam which keeps running average and would update grads even when they are 0.
For example - If N1 & N2 are used for first input, then their grad is initialised to a number. If in the next input networks N2 & N3 are used, then naively taking an optimiser step after zero_grad won’t prevent updates in N1, since it’s gradients would be 0 not None.
In code of Adam, a networks isn’t updated only if its grad is None`, which is as expected. But to solve the issue above, I believe it would be useful to have something like a None_grad function for the optimiser and networks.
Suggestions for any alternative methods to do this task are welcome.