I am trying to set static_graph=True
in DDP, because I believe it should work in my case. However, during backward I get the error
RuntimeError: Your training graph has changed in this iteration, e.g., one parameter is unused in first iteration, but then got used in the second iteration. this is not compatible with static_graph set to True.
I would like to see which parameter(s) are causing this.
So far, I’ve tried enabling debugging via
os.environ["TORCH_CPP_LOG_LEVEL"] = "INFO"
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
But the only relevant output I get is
[I logger.cpp:377] [Rank 0 / 2] [before iteration 1] Training ActorNetwork unused_parameter_size=0
every time the model is called. unused_parameter_size
always equals zero.
I’ve also tried logging the forward trace each time the model is called using a hook:
torch.nn.modules.module.register_module_forward_hook(
lambda module, input, output: print(module)
)
but the traces are identical between the first two model calls (verified using diff
command).
How can I track down the cause of the error during backward
?