DDP how to find unused parameters preventing `static_graph=True` from working?

ArchieGertsman · January 12, 2023, 9:19pm

I am trying to set static_graph=True in DDP, because I believe it should work in my case. However, during backward I get the error
RuntimeError: Your training graph has changed in this iteration, e.g., one parameter is unused in first iteration, but then got used in the second iteration. this is not compatible with static_graph set to True.
I would like to see which parameter(s) are causing this.

So far, I’ve tried enabling debugging via

os.environ["TORCH_CPP_LOG_LEVEL"] = "INFO"
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"

But the only relevant output I get is
[I logger.cpp:377] [Rank 0 / 2] [before iteration 1] Training ActorNetwork unused_parameter_size=0
every time the model is called. unused_parameter_size always equals zero.

I’ve also tried logging the forward trace each time the model is called using a hook:

torch.nn.modules.module.register_module_forward_hook(
    lambda module, input, output: print(module)
)

but the traces are identical between the first two model calls (verified using diff command).

How can I track down the cause of the error during backward?

irisz · January 13, 2023, 11:10pm

Have you set find_unused_parameters=True when initializing DDP? If not, could you try this?

Currently, find_unused_parameters=True must be passed into torch.nn.parallel.DistributedDataParallel() initialization if there are parameters that may be unused in the forward pass.