Loss.backward runs slowly for distributed training

xdcesc · November 28, 2018, 2:58pm

When I use DistributedDataParallel to train model over two machines (each with one GPU), loss.backward() cost much more time to run than using two GPUs in one machine. To be exactly, loss.backward() costs 0.3 seconds for distributed training mode, and only 0.01 seconds for non-distributed training mode. Does anyone know why?

JuanFMontesinos · November 28, 2018, 4:04pm

Well, i’m not an expert but if u have to share data among 2 machines the transmission requires time…

SimonW · November 28, 2018, 4:07pm

Note that unless you synchronize, the times you measure may not be accurate.

Xuepeng_Wang · November 29, 2018, 1:06am

In my understanding， the execution order is loss.backward --> all_reduce --> … transmission time should be included in all_reduce step.

Xuepeng_Wang · November 29, 2018, 1:45am

Already use torch.cuda.synchronize() before loss.backward(), but still slow