How to use multiple GPUs (DataParallel) for training a model that used to use one gpu

isalirezag · June 23, 2018, 6:30pm

If I want to use multiple gpus for a network should i specifically write my network in a way that it is designed to be train on multiple gpus or can i just add some comment to switch between one gpu and multiple gpus?

For example, lets say im doing this example (training a classifier: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) and I used to train it on one gpu, can i easily do something to train it on multiple gpus?

If the answer is no, please consider this for as your possible future features.

Thanks

ptrblck · June 23, 2018, 8:04pm

You can just wrap your model in DataParallel and specify the device_ids you would like to use.
The data will be split among the batch dimensions. See the data parellel tutorial for more information.

isalirezag · June 23, 2018, 9:01pm

@ptrblck So I easily should just add:

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

to my code and that is it?

BTW the out put of that command in my system is:

("Let's use",3L,'GPUs!')

I am not sure what is the meaning of L after 3, but i guess that is no biggi for now, i understood what it means…

One more question.
Lets say I have 3GPUs but I just want to use 2 of them. how should i specify that?

ptrblck · June 23, 2018, 9:24pm

The 3L is just printing as a long type.
If you would like to use your first two GPUs, just pass the device_ids to DataParallel:

model = nn.DataParallel(model, device_ids=[0, 1])

OrNot · August 18, 2019, 11:09pm

I can not distribute the model to multiple specified gpus suppose I pass 1,2,3,4 from args.

Colud you pls help me on this ?

Thanks

use_cuda = torch.cuda.is_available()
if use_cuda:
     gpu_ids = list(map(int, args.gpu_ids.split(',')))
     cuda='cuda:'+ str(gpu_ids[0])
     model = DataParallel(model,device_ids=gpu_ids)

device= torch.device(cuda if use_cuda else 'cpu')
model.to(device)

ptrblck · August 18, 2019, 11:10pm

The GPU mapping starts at index0, so try to pass 0,1,2,3 as the argument.

OrNot · August 18, 2019, 11:13pm

suppose the GPU 0 is running other users’ script . what should I do?

ptrblck · August 18, 2019, 11:17pm

If your system has >=5 GPUs, your code should work.
Could you print gpu_ids before passing it to nn.DataParallel?
Also, what does nvidia-smi show?
Do you see any utilization in the four GPUs?

OrNot · August 18, 2019, 11:24pm

8 GPUs in machine.
I have printed gpu_ids . it shows as I passed from argument, say, 0,1,2,3,4
nvidia-smi shows only GPU 0 is working
No other scripts running

ptrblck · August 18, 2019, 11:25pm

Are you passing a batch with more than 5 samples to the model?
Note that each chunk of the batch will be send to each GPU, so you should at least pass one sample for each GPU.

OrNot · August 18, 2019, 11:52pm

oops… maybe you are right. For some reason, I have to take a prediction before batch training.

I will check my code.

Thanks a lot

drevicko · September 10, 2019, 2:56am

In the tutorials, it mentions nothing about training (ie: no loss function, loss.backward(), optimizer.step() ).

When the model is converted to a DataParallel model, does the backprop get seamlessly handled behind the scenes? I guess the same gradients would be passed to all instances of the model across gpus, is that right? Is there another mechanism to synchronise the model instances across gpus?

ptrblck · September 10, 2019, 6:46am

The underlying workflow of DataParallel is described in this blog post by @Thomas_Wolf in a detailed way.

You could use the functional API and apply the scatter and gather methods manually, if you need more control over what’s being executed.

che85 · October 1, 2019, 3:03pm

When using DataParallel, is there is a way to set maximum allocated memory for single GPU?

When running my custom loss, additional memory gets allocated during the forward pass which is NOT only a few hundred megabytes. DataParallel is leaving some additional space on my GPU memory, but it’s not enough for running the forward loss.

Other than that, how does DataParallel decide how to split data?

Best,
Christian

ptrblck · October 1, 2019, 4:06pm

Not that I’m aware of. Did the work around mentioned in the blog post not work for you?

The batch will be chunked in dim0 based on the number of available GPUs.

che85 · October 1, 2019, 6:40pm

Not that I’m aware of. Did the work around mentioned in the blog post not work for you?

I implemented everything as mentioned in the blog post but GPU #1 still runs out of memory when increasing the batch_size. I notice a significant increase in memory inside the loss forward method.

So I think initialization is fine but at runtime it just uses more memory with every computation added to the graph inside the loss forward method.

DataParallel has to have a mechanism to predict how much memory will be used by a model as well as the loss(criterion), right? So I should expose that information to DataParallelCriterion when initializing or change my loss function.

What’s the correct way of doing this?

ptrblck · October 1, 2019, 6:59pm

Is your training running fine on a single GPU?
If the loss is consuming so much memory, I would assume a single GPU should consume the same amount.

Not really, as the computation graph is defined dynamically based on your forward pass.

che85 · October 1, 2019, 7:23pm

yes on a single GPU with smaller batch-size works fine.

zepmck · February 18, 2020, 11:21am

I am trying to parallelize an existing model (Transformer) on muliple GPUs. I have a NVIDIA DGX-1 with 8 Volta.

These are the issues/questions I ran into:

not clear which value should I set for model.to(device). currently trying with simply .to(‘cuda’)
the model is using a lambda LR scheduler but with GPUs > 1 it exits with an execution error “UnboundLocalError: local variable ‘values’ referenced before assignment”
how can I effectlvely check tensor boundaries limit? I am getting the impression that moving a model from 1 to multiple GPUs requires a profound refactoring of the code. It is extremely difficult to understand where indexs can go out of bounds.

Thanks

Melody_doudou · July 15, 2020, 12:02pm

I came across the same problem as you did. Could you please tell me how you fixed it? THANKS