Splitting batch at desired proportions in DataParallel

luchinoprince · December 22, 2022, 10:45am

Hello to everyone.

I just added a second GPU to my machine and I would like to leverage the functionalities of DataParallel https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html to train my model on both. The problem is that the two GPUs have different memories, namely one is 24GBs, while the other is 12GBs. Hence to fully leverage them it would be logical not to split the bath into two equal halves, but into 2/3 and 1/3.

To my knowledge there is no function that allows you to specify such a split, does anybody now if there is one or if there are other libraries that allow such a split?

mrshenli · December 28, 2022, 4:21pm

The problem is that the two GPUs have different memories, namely one is 24GBs, while the other is 12GBs. Hence to fully leverage them it would be logical not to split the bath into two equal halves, but into 2/3 and 1/3.

Hey @luchinoprince, DataParallel does not offer this API. Technically you can do that with DistributedDataParallel, but it could hang, as running two concurrent processes using NCCL on the same GPU can lead to undefined behavior.

Also splitting the 24GB GPU into 2 12GB virtual ones might not give you the desired speedup, unless it is also 2X faster than the 12GB GPU.

A work around could be copy+modify or monkey-patch this part of code, and send 2X batch size to the 24GB GPU.

github.com

pytorch/pytorch/blob/8b55b86dbdc500c441c2312c79a8ac8b747de17f/torch/nn/parallel/data_parallel.py#L150-L172


      
          def forward(self, *inputs, **kwargs):
              with torch.autograd.profiler.record_function("DataParallel.forward"):
                  if not self.device_ids:
                      return self.module(*inputs, **kwargs)
          
                  for t in chain(self.module.parameters(), self.module.buffers()):
                      if t.device != self.src_device_obj:
                          raise RuntimeError("module must have its parameters and buffers "
                                             "on device {} (device_ids[0]) but found one of "
                                             "them on device: {}".format(self.src_device_obj, t.device))
          
                  inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
                  # for forward function without any inputs, empty list and dict will be created
                  # so the module can be executed on one device which is the first one in device_ids
                  if not inputs and not kwargs:
                      inputs = ((),)
                      kwargs = ({},)
          
                  if len(self.device_ids) == 1:
                      return self.module(*inputs[0], **kwargs[0])

This file has been truncated. show original

luchinoprince · January 2, 2023, 1:41pm

Hey @mrshenli , thanks for the reply and for the information.

Unfortunately the 24GBs GPU is also faster than the latter, I will definetly try to change the code and see if I can leverage both the GPUs. I guess it will take some time to understand how to properly change it, especially where.

Thanks again for the information,
Luca