How to parallelise a pytorch model on GPU?

ranger · December 5, 2018, 10:12am

Hi!

My model fits on single gpu and works nicely but it is slow. In order to scale the computations, I want to parallelise the model so that I can use multiple gpus.

I have done
model = torch.nn.DataParallel(model).cuda()

in place of model.cuda()

When I run this with 2 gpus, it is working with batch-size=1 but only using single gpu and when I increase the batch-size it says out of memory.

Looking for response.

Thanks

kaixin · December 5, 2018, 10:24am

You need to specify the argument device_ids in nn.DataParallel(). The default is to use a single GPU.

Best

ranger · December 5, 2018, 10:36am

model = torch.nn.DataParallel(model,device_ids=[0,1]).cuda()

Changed to this, Still it is working on the first gpu only and then exploding which I am giving in CUDA_VISIBLE_DEVICES.

Giving the same error.

kaixin · December 5, 2018, 11:01am

Did you give the device_ids (e.g. [0,1]) in CUDA_VISIBLE_DEVICES?

ranger · December 5, 2018, 11:52am

I ran command for training as CUDA_VISIBLE_DEVICES=0,1 as well inside code I have this line

model = torch.nn.DataParallel(model,device_ids=[0,1]).cuda()

kaixin · December 5, 2018, 1:02pm

Usually I do

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1'

and everything works fine. Maybe you can try this.

Hope it works.

ptrblck · December 5, 2018, 1:11pm

Maybe you are running out of memory on the default device which will gather and scatter some parameters, thus usually using a bit more memory than the other devices.

This behavior is described in @Thomas_Wolf’s blog post.

ranger · December 5, 2018, 5:18pm

@ptrblck With the help of nvidia-smi, I am monitoring the gpu memory usage. There I am getting only one of the gpus getting ~11 GB of memory and then it says out of memory and the other gpu has a basic memory usage of 2 MBs.

I hope that should not be the case as I understand from the blog.

ptrblck · December 5, 2018, 5:19pm

How much memory is one GPU using for single GPU training? Is it also approx. 11GB or much lower?

ranger · December 5, 2018, 5:19pm

8 GB usage for batch-size=1

ranger · December 5, 2018, 5:20pm

My thought is that the network should get loaded on both the gpu(if using 2) and then may be the error comes.

ptrblck · December 5, 2018, 5:27pm

How large is your batch size using nn.DataParallel?
Would this batch with the model fit on a single GPU without running the forward and backward pass?

I would assume the same, i.e. that the gather on the “master” GPU might run OOM, but not during the initialization.

ranger · December 5, 2018, 5:30pm

I am trying with batch-size=2 and giving 2 gpus for parallel.
Any more math i should do regarding the extra computation on gpu 0?

ptrblck · December 5, 2018, 6:14pm

Just for the sake of completeness:
Based on a small chat, it seems this code base is used.
Currently the Trainer class provides convenient methods to train the model. However, skimming through the code it looks like some refactoring would be needed to make this code executable for nn.DataParallel, e.g. since the optimizer seems to be embedded in the trainer class.

I’m also not sure how these lines of code would be handled by nn.DataParallel, since no GPU id is passed to the cuda calls. It’s currently a guess, but I think this might also cause the OOM issue in this case.

wen75741 · May 23, 2019, 1:14am

Does this mean the nn.DataParallel needs torch.optim.Optimizer to indicate where to scatter and gather but a single .backward() cannot be handled?

ptrblck · May 23, 2019, 10:11am

No, DataParallel in its basic form is just applied on the model, such that the input batch will be split in dim0 and each specified GPU will get a chunk. The forward and backward passes are executed in parallel and all necessary gradients etc. are finally gathered on the default device.
Your training routine should not change if you are using DataParallel.

However, in the mentioned code base, the Trainer class is handling the models, optimizers etc.
Using DataParallel by just wrapping your model in it and leaving all other code snippets as they were, is thus most likely not possible, and would need some code refactoring.

zepmck · February 20, 2020, 10:55am

Is there a way to slipt the Model into several sub models and make them parallel gradually? Handling a complex Model as a whole is an almost impossible task. Thanks.