Loss.backward runs slowly for distributed training

When I use DistributedDataParallel to train model over two machines (each with one GPU), loss.backward() cost much more time to run than using two GPUs in one machine. To be exactly, loss.backward() costs 0.3 seconds for distributed training mode, and only 0.01 seconds for non-distributed training mode. Does anyone know why?

Well, i’m not an expert but if u have to share data among 2 machines the transmission requires time…

Note that unless you synchronize, the times you measure may not be accurate.

In my understanding, the execution order is loss.backward --> all_reduce --> … transmission time should be included in all_reduce step.

Already use torch.cuda.synchronize() before loss.backward(), but still slow