How to parallelise a pytorch model on GPU?

(Harshil Jain) #1


My model fits on single gpu and works nicely but it is slow. In order to scale the computations, I want to parallelise the model so that I can use multiple gpus.

I have done
model = torch.nn.DataParallel(model).cuda()

in place of model.cuda()

When I run this with 2 gpus, it is working with batch-size=1 but only using single gpu and when I increase the batch-size it says out of memory.

Looking for response.

Thanks :slight_smile:


You need to specify the argument device_ids in nn.DataParallel(). The default is to use a single GPU.


(Harshil Jain) #3

model = torch.nn.DataParallel(model,device_ids=[0,1]).cuda()

Changed to this, Still it is working on the first gpu only and then exploding which I am giving in CUDA_VISIBLE_DEVICES.

Giving the same error.


Did you give the device_ids (e.g. [0,1]) in CUDA_VISIBLE_DEVICES?

(Harshil Jain) #5

I ran command for training as CUDA_VISIBLE_DEVICES=0,1 as well inside code I have this line

model = torch.nn.DataParallel(model,device_ids=[0,1]).cuda()


Usually I do

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1'

and everything works fine. Maybe you can try this.

Hope it works.


Maybe you are running out of memory on the default device which will gather and scatter some parameters, thus usually using a bit more memory than the other devices.

This behavior is described in @Thomas_Wolfโ€™s blog post.

(Harshil Jain) #8

@ptrblck With the help of nvidia-smi, I am monitoring the gpu memory usage. There I am getting only one of the gpus getting ~11 GB of memory and then it says out of memory and the other gpu has a basic memory usage of 2 MBs.

I hope that should not be the case as I understand from the blog.


How much memory is one GPU using for single GPU training? Is it also approx. 11GB or much lower?

(Harshil Jain) #10

8 GB usage for batch-size=1

(Harshil Jain) #11

My thought is that the network should get loaded on both the gpu(if using 2) and then may be the error comes.


How large is your batch size using nn.DataParallel?
Would this batch with the model fit on a single GPU without running the forward and backward pass?

I would assume the same, i.e. that the gather on the โ€œmasterโ€ GPU might run OOM, but not during the initialization.

(Harshil Jain) #13

I am trying with batch-size=2 and giving 2 gpus for parallel.
Any more math i should do regarding the extra computation on gpu 0?


Just for the sake of completeness:
Based on a small chat, it seems this code base is used.
Currently the Trainer class provides convenient methods to train the model. However, skimming through the code it looks like some refactoring would be needed to make this code executable for nn.DataParallel, e.g. since the optimizer seems to be embedded in the trainer class.

Iโ€™m also not sure how these lines of code would be handled by nn.DataParallel, since no GPU id is passed to the cuda calls. Itโ€™s currently a guess, but I think this might also cause the OOM issue in this case.

How pytorch's parallel method and distributed method works?