How to overlap backward and communication?

stgzr · November 28, 2018, 3:34am

I am trying to train a CNN distributedly. As the data-parallel nodes increases, the communication time gets longer. So I think overlap backward() and all_reduce() is a good way to reduce communication time. But I wonder how could I do this. Is there an easy way to do so?
My PyTorch version is 0.4.0.

albanD · November 28, 2018, 10:23am

Hi,

The version of DP in master should already start doing the all_reduce of each parameter as soon as it’s gradient is computed. Effectively overlapping the rest of the backward and the communication.

stgzr · November 29, 2018, 3:30am

Thank you!
BTW, does the version 1.0rc1 aleady have the effective overlapping features?
Could I just use the code below instead of DP module to do an overlapping operation?

loss.backward()
dist.all_reduce(flatten_grads)

albanD · November 29, 2018, 10:39am

Yes the nightly build will have that.
The code in the new DP is better than that as the all_reduce will start before the end of the backward() call.

stgzr · January 4, 2019, 2:15am

Thanks. I tried several methods but I cannot get any improvements in PyTorch v1.0. So could you describe more details or code examples so that I can use this features?