I recently came across the unused parameters error (RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.) when training with DDP. In my setup I need to train one network A for some amount of iterations N, after which I will be training network B for some amount of iterations M. Network A will be used to generate some data used in training network B. The process of training networks A and B is then iterated.
So, when training network B I am not using network A of my model (therefore the RuntimeError). In this case, what is the advised setup for training network A and B (without needing to specify find_unused_parameters=True, to avoid additional computational overheads)?
Thank you very much in advance!
Are network A and network B in the same model? It seems like they are and then you do DDP(model), but why not do DDP(modelA) and DDP(modelB)? Does the data generated by A need to be back propagated after calculating the loss from B?
Yes, network A and network B are in the same model. So what you’re saying is that I should have 2 (separate) models, one only with network A and one only with network B. Is it right that then DDP should not complain about the parameter of network B (in model_B) if I only call model_A(data)?
Regarding your second point there is currently no need to backdrop the gradients (if this happens though, I would assume that I wouldn’t run into this problem as long as I use network A to generate data for network B in every iteration).