Pytorch/Pytorch_lightning DDP

Hi everyone, I’m using PyTorch_lightning DDPPlugin module for training my model on multiple GPUs.
I used this code to initialise the plugin.
plugs = DDPPlugin(find_unused_parameters=True).

I first set the find_unused_parameters as False but it gave an error and asked to turn it True. When I turned it True, it now gives an error stating -
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.

My PyTorch version is 1.10.0 and PyTorch_lightning version is 1.5. I cannot change my library versions as I’m working on someone’s else code and it might break the pipeline. Is this issue solved for this version of PyTorch? I saw some posts here that its still in progress and cannot find a solution online.

Did you try calling the _set_static_graph() method as mentioned in the error message?

The find_unused_parameters=True is model dependent and should be set when there are some parts of the model where not all of the parameters participate in the forward pass. I assume that this is the case for your model?

Hi @H-Huang , I solved the problem. Thanks