Process got stuck when set find_unused_parameters=True in DDP

oliver_ss · December 16, 2020, 7:58am

Hi,

Thanks for your help. When I tried to make a toy example for you to reproduce the problem, I found that the toy example really worked well, which was strange.

I finally found that the problem lies in the loss function. Specifically, the output of the model is not fully used when calculating the loss as for the detection or instance segmentation task, only positive samples are used to get the loss. So if a certain branch does not have any corresponding groundtruth in a certain step, the error of unused parameters would be thrown.

To handle this problem, simply set the “find_unused_parameters=True” seems not to work, as least in my case. A simple solution I could find is adding a safety loss, this is, fetch all the outputs of my model and multiply them by 0. This would give all the parameters a 0 gradient but it might waster some cuda memory.