Splitting batch at desired proportions in DataParallel

Hello to everyone.

I just added a second GPU to my machine and I would like to leverage the functionalities of DataParallel https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html to train my model on both. The problem is that the two GPUs have different memories, namely one is 24GBs, while the other is 12GBs. Hence to fully leverage them it would be logical not to split the bath into two equal halves, but into 2/3 and 1/3.

To my knowledge there is no function that allows you to specify such a split, does anybody now if there is one or if there are other libraries that allow such a split?

The problem is that the two GPUs have different memories, namely one is 24GBs, while the other is 12GBs. Hence to fully leverage them it would be logical not to split the bath into two equal halves, but into 2/3 and 1/3.

Hey @luchinoprince, DataParallel does not offer this API. Technically you can do that with DistributedDataParallel, but it could hang, as running two concurrent processes using NCCL on the same GPU can lead to undefined behavior.

Also splitting the 24GB GPU into 2 12GB virtual ones might not give you the desired speedup, unless it is also 2X faster than the 12GB GPU.

A work around could be copy+modify or monkey-patch this part of code, and send 2X batch size to the 24GB GPU.

1 Like

Hey @mrshenli , thanks for the reply and for the information.

Unfortunately the 24GBs GPU is also faster than the latter, I will definetly try to change the code and see if I can leverage both the GPUs. I guess it will take some time to understand how to properly change it, especially where.

Thanks again for the information,
Luca