Is it possible to encounter the "unused parameters" error in DDP despite all parameters participating in loss?

EachOneChew · March 28, 2023, 9:45am

I believe the error is one that has been brought up already, many times:

Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Usually this is because someone has some parameter that doesn’t participate in the forward output, or doesn’t use some part of the forward output. However, I have checked every parameter listed in the error (302-319) AND every other parameter in the model after the backward pass - all have their .grad populated. Somehow a specific subset of parameters are not getting reduced by DDP despite being given gradients by autograd.

Parameters 302-319 also happen to be exactly the parameters of a projection and prediction MLP head. I am sure this is not a coincidence. I try to come up with a minimal reproducible example if I can, but for now I would like to know if the above situation is documented.

irisz · March 31, 2023, 8:41pm

EachOneChew:

Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by

Have you tried passing find_unused_parameters=True when wrapping the model?

If this is not helpful, please file a github issue with a minimal repro.

EachOneChew · March 31, 2023, 11:05pm

I will try this, thank you