Torch.nn.parallel.data_parallel for distributed training: backward pass model update

hotcheetos_puff · January 7, 2019, 5:31am

When using torch.nn.parallel.data_parallel for distributed training, models are copied onto multiple GPU’s and can complete a forward pass without the copied models effecting each other. However, how do the copied models interact during the backward pass? How are the model weights updated on each GPU?

When reading the documentation [1] I see the explanation:
“gradients from each replica are summed into the original module”

but I’m not sure how to interpret this statement. My best guess is that each GPU’s model is updated with the sum of all GPU’s model’s gradients, which I would then interpret that there is locking across GPU’s so they each start training on a new mini-batch only after they all finish processing their current mini-batch.

[1] https://pytorch.org/docs/master/nn.html?highlight=dataparallel#dataparallel-layers-multi-gpu-distributed

albanD · January 7, 2019, 9:10am

Hi,

Each gpu compute the gradients for it’s part of the batch. Then they are accumulated on the “main” model where the weight update is done. Then this “main” model shares it’s weight to all the other gpus so that all models have the same weight.

hotcheetos_puff · January 7, 2019, 9:35am

Hi Alban,

Thanks for clarifying. Does this mean that this parallelization utilizes locking, ensuring that each GPU model updates its weights from the “main” model before moving on to the next mini-batch?

albanD · January 7, 2019, 1:26pm

Yes the locking is builtin and the weights will properly be updated before they are used.

sanchit2843 · June 22, 2020, 4:29pm

Hey, I am facing somme issue with data parallel. I am training on 4 v100 with a batchsize of 1. The time of forward pass seems to scale but the time of backward pass is taking 4* times in comparison to 1 V100. So there is no significant boost in speed when using 4 gpus. I guess the backward pass is taking place on single gpu. I am using nn.parallel.DataParallel, is there any solution to this problem? I can share more details if you want.

mrshenli · June 22, 2020, 6:18pm

Hey @sanchit2843

When using batch_size == 1, DataParallel won’t be able to parallelize the computation, as the input cannot be chunked by the scatter linked below.

github.com

pytorch/pytorch/blob/df8d6eeb19423848b20cd727bc4a728337b73829/torch/nn/parallel/data_parallel.py#L151


      
          def forward(self, *inputs, **kwargs):
              if not self.device_ids:
                  return self.module(*inputs, **kwargs)
          
              for t in chain(self.module.parameters(), self.module.buffers()):
                  if t.device != self.src_device_obj:
                      raise RuntimeError("module must have its parameters and buffers "
                                         "on device {} (device_ids[0]) but found one of "
                                         "them on device: {}".format(self.src_device_obj, t.device))
          
              inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
              if len(self.device_ids) == 1:
                  return self.module(*inputs[0], **kwargs[0])
              replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
              outputs = self.parallel_apply(replicas, inputs, kwargs)
              return self.gather(outputs, self.output_device)
          
          def replicate(self, module, device_ids):
              return replicate(module, device_ids, not torch.is_grad_enabled())
          
          def scatter(self, inputs, kwargs, device_ids):

but the time of backward pass is taking 4* times in comparison to 1 V100

Can you share a minimum repro of this? In general, the 4X slowdown is possible due to Python GIL contention (which you can avoid by using DistributedDataParallel). But I feel it this should not be applicable here as each batch only contains one sample, and the replicated models are not really involved in forward and backward.

sanchit2843 · June 22, 2020, 7:28pm

Hey thanks for the answer, actually I think it is because of all 4 backpropogation going with one gpu only. Is there any solution with which I can run it with one batch size. And what is the meaning of repro? Thanks in advance.

sanchit2843 · June 22, 2020, 7:49pm

I tried with batchsize of 2 as well. The epoch time with single gpu is 2.5 hours and with four gpu is 3.3 hrs. Any solution will be helpful.