Parallel training All_Reduce on single GPU

Hi,

I want to use data parallel to train my model on single GPU. I followed the example of Pytorch DISTRIBUTED DATA PARALLEL, and pass the same device_id to 4 processes. With all-reduce sync method, it runs even slower than using a single process. The interesting thing is by disabling all_reduce sync-up for gradients, there is a great speed up of training itself. So I think the GPU has extra compute capacity for multiple training process, but the bottleneck is all-reduce method. Any one know the reason for this bottleneck? Is there any other way to sync param gradients without using all_reduce? Thanks.

I want to use data parallel to train my model on single GPU. I followed the example of Pytorch DISTRIBUTED DATA PARALLEL, and pass the same device_id to 4 processes. With all-reduce sync method, it runs even slower than using a single process.

The all_reduce op expects each process to exclusively work on a different GPU. If the same GPU is shared across processes, it does not guarantee to work.

The interesting thing is by disabling all_reduce sync-up for gradients, there is a great speed up of training itself.

How did you disable that in DDP?

So I think the GPU has extra compute capacity for multiple training process, but the bottleneck is all-reduce method.

Looks like your model is small enough such that each op will just occupy a subset of resources in the GPU.

Is there any other way to sync param gradients without using all_reduce?

Since you just use one GPU, you can try using multi-processing. Say launch a main process and use torch.multiprocessing to spawn 4 subprocesses. Use torch.multiprocessing.SimpleQueue to pass grad tensors from sub-processes back to the main process, let the main process accumulate them, and then pass the result back to all subprocesses.

The test below can serve as a example for SimpleQueue:

Hi @mrshenli, thanks for reply. I was following this example to use a average_gradients function, which calls all_reduce:
截屏2020-09-15 下午1.49.16
So to disable all_reduce, I just didn’t call this function when doing the training. And I notice there are two different examples, the other one uses DDP model and the sync-up step is done in backward computation, but I want to have a customized sync-up method, that’s why I choose the way with a separate average_gradients function.

I have tried using SimpleQueue to pass some large data, but it seems slow when calling .get(), I’ll try to use it for passing grad tensors and see how it performs.