My model does not contain a forward method, can this confuse the DDP?
I am confused. Do you mean that you have only used built-in layers such as:
model = nn.Sequential(
nn.Conv2d(i, j, k),
If so, you still called
forward of each module implicitly, and the allreduce during backward pass should be triggered.
Without investigating the source code, I cannot find any other reason of unsynced gradients, if you didn’t use no_sync context manager. You should verify if allreduce is ever been invoked. A few ideas:
- You can try torch.profiler and check if there is any allreduce operator in your GPU traces.
- Alternatively, you can try registering a PowerSGD DDP comm hook, and check if there is any log about PowerSGD stats.
- Not sure if you use the slower
DataParallel instead of
DistributedDataParallel can bypass this.