Parallel training All_Reduce on single GPU

Yi_Zhang · September 11, 2020, 9:56pm

Hi,

I want to use data parallel to train my model on single GPU. I followed the example of Pytorch DISTRIBUTED DATA PARALLEL, and pass the same device_id to 4 processes. With all-reduce sync method, it runs even slower than using a single process. The interesting thing is by disabling all_reduce sync-up for gradients, there is a great speed up of training itself. So I think the GPU has extra compute capacity for multiple training process, but the bottleneck is all-reduce method. Any one know the reason for this bottleneck? Is there any other way to sync param gradients without using all_reduce? Thanks.

mrshenli · September 15, 2020, 3:14am

I want to use data parallel to train my model on single GPU. I followed the example of Pytorch DISTRIBUTED DATA PARALLEL, and pass the same device_id to 4 processes. With all-reduce sync method, it runs even slower than using a single process.

The all_reduce op expects each process to exclusively work on a different GPU. If the same GPU is shared across processes, it does not guarantee to work.

The interesting thing is by disabling all_reduce sync-up for gradients, there is a great speed up of training itself.

How did you disable that in DDP?

So I think the GPU has extra compute capacity for multiple training process, but the bottleneck is all-reduce method.

Looks like your model is small enough such that each op will just occupy a subset of resources in the GPU.

Is there any other way to sync param gradients without using all_reduce?

Since you just use one GPU, you can try using multi-processing. Say launch a main process and use torch.multiprocessing to spawn 4 subprocesses. Use torch.multiprocessing.SimpleQueue to pass grad tensors from sub-processes back to the main process, let the main process accumulate them, and then pass the result back to all subprocesses.

The test below can serve as a example for SimpleQueue:

github.com

pytorch/pytorch/blob/2c4b4aa81bc8dba8272e9c7190edcaa3e114ec15/test/test_multiprocessing.py#L580-L600


      
          def test_event_multiprocess(self):
              event = torch.cuda.Event(enable_timing=False, interprocess=True)
              self.assertTrue(event.query())
          
              ctx = mp.get_context('spawn')
              p2c = ctx.SimpleQueue()
              c2p = ctx.SimpleQueue()
              p = ctx.Process(
                  target=TestMultiprocessing._test_event_multiprocess_child,
                  args=(event, p2c, c2p))
              p.start()
          
              c2p.get()  # wait for until child process is ready
              torch.cuda._sleep(50000000)  # spin for about 50 ms
              event.record()
              p2c.put(0)  # notify child event is recorded
          
              self.assertFalse(event.query())
              c2p.get()  # wait for synchronization in child
              self.assertTrue(event.query())
              p.join()

Yi_Zhang · September 15, 2020, 5:48pm

Hi @mrshenli, thanks for reply. I was following this example to use a average_gradients function, which calls all_reduce:
截屏2020-09-15 下午1.49.16
So to disable all_reduce, I just didn’t call this function when doing the training. And I notice there are two different examples, the other one uses DDP model and the sync-up step is done in backward computation, but I want to have a customized sync-up method, that’s why I choose the way with a separate average_gradients function.

I have tried using SimpleQueue to pass some large data, but it seems slow when calling .get(), I’ll try to use it for passing grad tensors and see how it performs.