I run a single machine, 2-GPU resnet-based training with one process for each GPU. Each process prints a msg at the start of each epoch.
After a few minutes, one process draws ahead of the other. The difference ends up to be multiple epochs, even though both processes do finish.
How can this speed difference occur, given that DDP synchronizes the two processes during each call to backward()
?
Based on a note on a Web site about different input sizes to different DDP processes I tried
with model.join():
<training loop>
But observed no different behavior. How does DDP sneak past that back prop synchronization point?
I unfortunately could not reproduce the problem on a simple case. But maybe I am missing some logic?
- Ubuntu 20.04
- Pytorch torch-1.7.1-py3.8
- torch.cuda.nccl.version(): 2708
- 2xNvidia GTX Titan
- Single machine, 2 process, one for each of the GPUs