How to use multiple GPUs (DataParallel) for training a model that used to use one gpu

If I want to use multiple gpus for a network should i specifically write my network in a way that it is designed to be train on multiple gpus or can i just add some comment to switch between one gpu and multiple gpus?

For example, lets say im doing this example (training a classifier: and I used to train it on one gpu, can i easily do something to train it on multiple gpus?

If the answer is no, please consider this for as your possible future features.


You can just wrap your model in DataParallel and specify the device_ids you would like to use.
The data will be split among the batch dimensions. See the data parellel tutorial for more information.


@ptrblck So I easily should just add:

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

to my code and that is it?

BTW the out put of that command in my system is:

("Let's use",3L,'GPUs!')

I am not sure what is the meaning of L after 3, but i guess that is no biggi for now, i understood what it means…

One more question.
Lets say I have 3GPUs but I just want to use 2 of them. how should i specify that? :slight_smile:

The 3L is just printing as a long type.
If you would like to use your first two GPUs, just pass the device_ids to DataParallel:

model = nn.DataParallel(model, device_ids=[0, 1])

I can not distribute the model to multiple specified gpus suppose I pass 1,2,3,4 from args.

Colud you pls help me on this ?


use_cuda = torch.cuda.is_available()
if use_cuda:
     gpu_ids = list(map(int, args.gpu_ids.split(',')))
     cuda='cuda:'+ str(gpu_ids[0])
     model = DataParallel(model,device_ids=gpu_ids)

device= torch.device(cuda if use_cuda else 'cpu')

The GPU mapping starts at index0, so try to pass 0,1,2,3 as the argument.

suppose the GPU 0 is running other users’ script . what should I do?

If your system has >=5 GPUs, your code should work.
Could you print gpu_ids before passing it to nn.DataParallel?
Also, what does nvidia-smi show?
Do you see any utilization in the four GPUs?

  1. 8 GPUs in machine.
  2. I have printed gpu_ids . it shows as I passed from argument, say, 0,1,2,3,4
  3. nvidia-smi shows only GPU 0 is working
  4. No other scripts running

Are you passing a batch with more than 5 samples to the model?
Note that each chunk of the batch will be send to each GPU, so you should at least pass one sample for each GPU.

oops… maybe you are right. For some reason, I have to take a prediction before batch training.

I will check my code.

Thanks a lot

In the tutorials, it mentions nothing about training (ie: no loss function, loss.backward(), optimizer.step() ).

When the model is converted to a DataParallel model, does the backprop get seamlessly handled behind the scenes? I guess the same gradients would be passed to all instances of the model across gpus, is that right? Is there another mechanism to synchronise the model instances across gpus?

1 Like

The underlying workflow of DataParallel is described in this blog post by @Thomas_Wolf in a detailed way.

You could use the functional API and apply the scatter and gather methods manually, if you need more control over what’s being executed.


When using DataParallel, is there is a way to set maximum allocated memory for single GPU?

When running my custom loss, additional memory gets allocated during the forward pass which is NOT only a few hundred megabytes. DataParallel is leaving some additional space on my GPU memory, but it’s not enough for running the forward loss.

Other than that, how does DataParallel decide how to split data?


Not that I’m aware of. Did the work around mentioned in the blog post not work for you?

The batch will be chunked in dim0 based on the number of available GPUs.

1 Like

Not that I’m aware of. Did the work around mentioned in the blog post not work for you?

I implemented everything as mentioned in the blog post but GPU #1 still runs out of memory when increasing the batch_size. I notice a significant increase in memory inside the loss forward method.

So I think initialization is fine but at runtime it just uses more memory with every computation added to the graph inside the loss forward method.

DataParallel has to have a mechanism to predict how much memory will be used by a model as well as the loss(criterion), right? So I should expose that information to DataParallelCriterion when initializing or change my loss function.

What’s the correct way of doing this?

Is your training running fine on a single GPU?
If the loss is consuming so much memory, I would assume a single GPU should consume the same amount.

Not really, as the computation graph is defined dynamically based on your forward pass.

yes on a single GPU with smaller batch-size works fine.

I am trying to parallelize an existing model (Transformer) on muliple GPUs. I have a NVIDIA DGX-1 with 8 Volta.

These are the issues/questions I ran into:

  • not clear which value should I set for currently trying with simply .to(‘cuda’)
  • the model is using a lambda LR scheduler but with GPUs > 1 it exits with an execution error “UnboundLocalError: local variable ‘values’ referenced before assignment
  • how can I effectlvely check tensor boundaries limit? I am getting the impression that moving a model from 1 to multiple GPUs requires a profound refactoring of the code. It is extremely difficult to understand where indexs can go out of bounds.


I came across the same problem as you did. Could you please tell me how you fixed it? THANKS