DistributedDataParallel gradient avergaing

I am experimenting with gradient compression techniques for reduced communication during distributed training. However, I found out that DDP by default averages the replica gradients with all-reduce. Is there some way to “turn this OFF”, since I will be aggregating the gradients in an encoded format?

1 Like

On the master branch, there is a prototype DDP communication hook feature, which is built for this purpose: https://github.com/pytorch/pytorch/issues/39272

In prior releases (<= v1.6), there is no way to turn gradient averaging off without modifying C++ code and recompile.

Update

Synced with Sinan (author of this comm hook feature), this will be reverted due to perf regression. We are investigating.

2 Likes

Examples can be found here: https://github.com/pytorch/pytorch/blob/c76fada4a859742ac679013b7428017a782e1432/torch/nn/parallel/distributed.py#L607-L684

IIUC, as of today, the communication bucket is still divided by the world size even if the hook is enabled. We are working on removing that division.

2 Likes

Thanks Shen Li.

I had in fact already seen the DDP communication hook PR and had interacted with SInan as well. I was actually looking for something more flexible which would allow me to measure the time and bits during communication.

I will definitely check the comm hook once ready.

I was actually looking for something more flexible which would allow me to measure the time and bits during communication.

It should be possible to do this in the current proposal of the communication hook. Could you elaborate a bit more on the limitations in the current proposal that might prevent us from doing these measurements?

As of now, I plan to measure the time taken for gradient accumulation, and the number of bits communicated (for each iteration). I might need to even find the bits communicated for each layer in the future to explore layer wise compression.

As of now, I plan to measure the time taken for gradient accumulation

Are you referring to the AccumulateGrad function? If so, the autograd profiler would display time taken for this function.

and the number of bits communicated (for each iteration)

This should be possible in the current proposal of the communication hook since you can add up the bits for all the buckets.

I might need to even find the bits communicated for each layer in the future to explore layer wise compression.

This is probably something that you can’t do with the existing hook since it provides the entire bucket to the user and currently there is no way to split out individual parameters from the bucket.