Hi. When training the model using DDP, I found that the RuntimeError
about unused parameters was not thrown as expected even find_unused_parameters
was set to False
.
Here’s a toy code snippet for reproduction.
import torch.nn as nn
import torch
import torch.distributed as dist
import torch.optim as optim
class Toy(nn.Module):
def __init__(self) -> None:
super().__init__()
self.a = nn.Linear(1, 1)
self.b = nn.Linear(1, 1)
def forward(self, x):
return self.a(x)
dist.init_process_group(backend='nccl')
model = Toy().to(device=dist.get_rank())
ddp_model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=False)
optimizer = optim.SGD(model.parameters(), lr=1)
optimizer.zero_grad()
x = torch.randn(2, 1).to(device=dist.get_rank())
out = ddp_model(x)
out.mean().backward()
print(dist.get_rank(), ddp_model.module.a.weight)
optimizer.step()
print(dist.get_rank(), ddp_model.module.a.weight)
dist.destroy_process_group()
Run the code above, I got
0 Parameter containing:
tensor([[0.6218]], device='cuda:0', requires_grad=True)
0 Parameter containing:
tensor([[0.7547]], device='cuda:0', requires_grad=True)
1 Parameter containing:
tensor([[0.6218]], device='cuda:1', requires_grad=True)
1 Parameter containing:
tensor([[0.9173]], device='cuda:1', requires_grad=True)
After one step optimization, the weight of model.a
is out of sync. It is expected that a RuntimeError
should be thrown here to remind me set find_unused_parameters=True
but it didn’t.
As expected, comment out the definition of self.b
or set find_unused_parameters=True
could get the right result.
So why this happens? Please let me know if i’m wrong somewhere.
Best regards.