When using torch.nn.parallel.DistributedDataParallel to train network, I’ve got " please add find_unused_parameters=True into DistributedDataParallel " error. After adding this flag into DistributedDataParallel, I can train the network normally.
I know the reason why the error occurs is that the network has some unused parameters. I wonder whether we have some tools to find what they are.
hi, nihao.
i have the same issues too. i trained the model a few step and torch.save the model to compare the parameters, and find the one not be updated(unused).
for example:
import torch
sd1 = torch.load("./work_dir/step_1.pth")["state_dict"]
sd4 = torch.load("./work_dir/step_5.pth")["state_dict"]
for k in sd1:
v1 = sd1[k]
v4 = sd4[k]
if (v1==v4).all():
print(k)
An easy way to find unused params is train your model on a single node without the DDP wrapper. after loss.backward() and before optimizer.step() call add the below lines
for name, param in model.named_parameters():
if param.grad is None:
print(name)
This will print any param which did not get used in loss calculation, their grad will be None.