"Expected to have finished reduction" error when dropping layers with DDP

I want to do layerdrop. Unfortunately when I do so I get an error:
Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. [...]

I thought this could be caused by the fact that I was doing DDP and different workers were dropping different layers leading to problems when trying to sync gradients. However I set the seed so the different workers should be dropping the same layer at the same time and I still get the error.

Why is this happening?

Thanks for posting @divinho. From your description it seems that your model is dropping a particular layer and the loss calculation does not use that layer at all. If you want to train in such way with DDP, can you try use find_unused_parameters=True when initializing DDP?

Currently, find_unused_parameters=True must be passed into torch.nn.parallel.DistributedDataParallel() initialization if there are parameters that may be unused in the forward pass.

1 Like

Thank you for the suggestion I will try that!

That solved it thank you!