How to parallelise a pytorch model on GPU?


My model fits on single gpu and works nicely but it is slow. In order to scale the computations, I want to parallelise the model so that I can use multiple gpus.

I have done
model = torch.nn.DataParallel(model).cuda()

in place of model.cuda()

When I run this with 2 gpus, it is working with batch-size=1 but only using single gpu and when I increase the batch-size it says out of memory.

Looking for response.

Thanks :slight_smile:

1 Like

You need to specify the argument device_ids in nn.DataParallel(). The default is to use a single GPU.


model = torch.nn.DataParallel(model,device_ids=[0,1]).cuda()

Changed to this, Still it is working on the first gpu only and then exploding which I am giving in CUDA_VISIBLE_DEVICES.

Giving the same error.

1 Like

Did you give the device_ids (e.g. [0,1]) in CUDA_VISIBLE_DEVICES?

I ran command for training as CUDA_VISIBLE_DEVICES=0,1 as well inside code I have this line

model = torch.nn.DataParallel(model,device_ids=[0,1]).cuda()

Usually I do

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1'

and everything works fine. Maybe you can try this.

Hope it works.

Maybe you are running out of memory on the default device which will gather and scatter some parameters, thus usually using a bit more memory than the other devices.

This behavior is described in @Thomas_Wolfโ€™s blog post.

1 Like

@ptrblck With the help of nvidia-smi, I am monitoring the gpu memory usage. There I am getting only one of the gpus getting ~11 GB of memory and then it says out of memory and the other gpu has a basic memory usage of 2 MBs.

I hope that should not be the case as I understand from the blog.

How much memory is one GPU using for single GPU training? Is it also approx. 11GB or much lower?

8 GB usage for batch-size=1

My thought is that the network should get loaded on both the gpu(if using 2) and then may be the error comes.

How large is your batch size using nn.DataParallel?
Would this batch with the model fit on a single GPU without running the forward and backward pass?

I would assume the same, i.e. that the gather on the โ€œmasterโ€ GPU might run OOM, but not during the initialization.

I am trying with batch-size=2 and giving 2 gpus for parallel.
Any more math i should do regarding the extra computation on gpu 0?

Just for the sake of completeness:
Based on a small chat, it seems this code base is used.
Currently the Trainer class provides convenient methods to train the model. However, skimming through the code it looks like some refactoring would be needed to make this code executable for nn.DataParallel, e.g. since the optimizer seems to be embedded in the trainer class.

Iโ€™m also not sure how these lines of code would be handled by nn.DataParallel, since no GPU id is passed to the cuda calls. Itโ€™s currently a guess, but I think this might also cause the OOM issue in this case.

1 Like

Does this mean the nn.DataParallel needs torch.optim.Optimizer to indicate where to scatter and gather but a single .backward() cannot be handled?

No, DataParallel in its basic form is just applied on the model, such that the input batch will be split in dim0 and each specified GPU will get a chunk. The forward and backward passes are executed in parallel and all necessary gradients etc. are finally gathered on the default device.
Your training routine should not change if you are using DataParallel.

However, in the mentioned code base, the Trainer class is handling the models, optimizers etc.
Using DataParallel by just wrapping your model in it and leaving all other code snippets as they were, is thus most likely not possible, and would need some code refactoring.

Is there a way to slipt the Model into several sub models and make them parallel gradually? Handling a complex Model as a whole is an almost impossible task. Thanks.