When using torch.nn.parallel.DistributedDataParallel to train network, I’ve got " please add find_unused_parameters=True into DistributedDataParallel " error. After adding this flag into DistributedDataParallel, I can train the network normally.
I know the reason why the error occurs is that the network has some unused parameters. I wonder whether we have some tools to find what they are.
I have the same issues and I need to find those unused parameters also. Please let me know if you have got any solutions.
i have the same issues too. i trained the model a few step and torch.save the model to compare the parameters, and find the one not be updated(unused).
sd1 = torch.load("./work_dir/step_1.pth")["state_dict"]
sd4 = torch.load("./work_dir/step_5.pth")["state_dict"]
for k in sd1:
v1 = sd1[k]
v4 = sd4[k]
I moved all the trainable parameters to the forward pass and then the problem got solved. Hope this help you.
An easy way to find unused params is train your model on a single node without the DDP wrapper. after
loss.backward() and before
optimizer.step() call add the below lines
for name, param in model.named_parameters():
if param.grad is None:
This will print any param which did not get used in loss calculation, their grad will be None.
Very good method!!! thank you very much!!!