Parameters out of sync over different ranks due to unused parameters

Hi. When training the model using DDP, I found that the RuntimeError about unused parameters was not thrown as expected even find_unused_parameters was set to False.

Here’s a toy code snippet for reproduction.

import torch.nn as nn
import torch
import torch.distributed as dist
import torch.optim as optim

class Toy(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.a = nn.Linear(1, 1)
        self.b = nn.Linear(1, 1)

    def forward(self, x):
        return self.a(x)

dist.init_process_group(backend='nccl')
model = Toy().to(device=dist.get_rank())
ddp_model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=False)
optimizer = optim.SGD(model.parameters(), lr=1)
optimizer.zero_grad()

x = torch.randn(2, 1).to(device=dist.get_rank())
out = ddp_model(x)
out.mean().backward()

print(dist.get_rank(), ddp_model.module.a.weight)
optimizer.step()
print(dist.get_rank(), ddp_model.module.a.weight)
dist.destroy_process_group()

Run the code above, I got

0 Parameter containing:
tensor([[0.6218]], device='cuda:0', requires_grad=True)
0 Parameter containing:
tensor([[0.7547]], device='cuda:0', requires_grad=True)
1 Parameter containing:
tensor([[0.6218]], device='cuda:1', requires_grad=True)
1 Parameter containing:
tensor([[0.9173]], device='cuda:1', requires_grad=True)

After one step optimization, the weight of model.a is out of sync. It is expected that a RuntimeError should be thrown here to remind me set find_unused_parameters=True but it didn’t.

As expected, comment out the definition of self.b or set find_unused_parameters=True could get the right result.

So why this happens? Please let me know if i’m wrong somewhere.

Best regards.

Found the solution. The second forward would throw the error as expected.