How to overlap backward and communication?

I am trying to train a CNN distributedly. As the data-parallel nodes increases, the communication time gets longer. So I think overlap backward() and all_reduce() is a good way to reduce communication time. But I wonder how could I do this. Is there an easy way to do so?
My PyTorch version is 0.4.0.

Hi,

The version of DP in master should already start doing the all_reduce of each parameter as soon as it’s gradient is computed. Effectively overlapping the rest of the backward and the communication.

Thank you!
BTW, does the version 1.0rc1 aleady have the effective overlapping features?
Could I just use the code below instead of DP module to do an overlapping operation?

loss.backward()
dist.all_reduce(flatten_grads)

Yes the nightly build will have that.
The code in the new DP is better than that as the all_reduce will start before the end of the backward() call.

Thanks. I tried several methods but I cannot get any improvements in PyTorch v1.0. So could you describe more details or code examples so that I can use this features?