I tried to train the model with DDP and a strange question occurred to me.
The training process worked really well if DataParallel was used instead of DistributedDataParallel.
When I wrapped the model with DDP, the exception below was raised:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_para
meters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If
you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return va
lue of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module whe
n reporting this issue (e.g. list, dict, iterable).
I have tried to set “find_unused_parameters=True”, but the training process would get stuck. Specifically, the loss failed to go backward.
I have also checked all the parameters in my model with the method proposed here: How to find the unused parameters in network
I am really sure all the parameters have changed after several iterations, but I would still meet the “RuntimeError” as above.
Ps: The output of the model is a list of tensors.
I am really confused and wondering whether there exists any tips to help solve this problem.
Could you maybe add a source code with reproduction of the problem?
One more suggestion. I’ve recently looked at a problem
where loss function didn’t depend on all the parameters and it caused some issues for the gradient computation. What helped is to look at the autograd graph via https://github.com/szagoruyko/pytorchviz
Thanks for your help. When I tried to make a toy example for you to reproduce the problem, I found that the toy example really worked well, which was strange.
I finally found that the problem lies in the loss function. Specifically, the output of the model is not fully used when calculating the loss as for the detection or instance segmentation task, only positive samples are used to get the loss. So if a certain branch does not have any corresponding groundtruth in a certain step, the error of unused parameters would be thrown.
To handle this problem, simply set the “find_unused_parameters=True” seems not to work, as least in my case. A simple solution I could find is adding a safety loss, this is, fetch all the outputs of my model and multiply them by 0. This would give all the parameters a 0 gradient but it might waster some cuda memory.
Yep, this is true. Because that mode in DDP would require full access to all outputs and then traverse the graph from those outputs to find unused parameters. That’s also the reason that the error message says: “(2) making sure all forward function outputs participate in calculating loss.”
A simple solution I could find is adding a safety loss, this is, fetch all the outputs of my model and multiply them by 0. This would give all the parameters a 0 gradient but it might waster some cuda memory.
Another option might be only returning outputs that participate in computing loss. Other outputs can be stored in some model attributes, and retrieve them separately after the forward pass.