How to use multiple GPUs in Pytorch?

Rafael_R · May 9, 2019, 1:13am

Hi,

I have a Pytorch model for machine translation. I rented a 4 GPU machine.

The dataset is very large.

How to train my model using the 4 GPUs?

I see the model uses only 1 GPU.

So, my question is:

What are the mechanics of training on 4 GPUs?
Any way I can make my model run on the 4 GPUs?

Thanks.

Rafael_R · May 9, 2019, 1:43am

hi, any help will save me a lot of money

ptrblck · May 9, 2019, 11:16am

Have a look at torch.nn.parallel.DistributedDataParallel.

pinocchio · August 2, 2019, 5:48pm

so pytorch or machines with multiple GPUs do not use the multiple GPUs by themselves?

ptrblck · August 2, 2019, 9:03pm

That’s right. You can use a one-liner and wrap your model in nn.DataParallel or use the recommended DDP approach.
Alternatively, you could also use model sharding and split the model among all GPUs in case you are working with a huge model.

Rafael_R · August 19, 2019, 12:15am

thanks, i am having some issues with mult-gpu training. i have multiple encoders, and the data parallel module is splitting along different dimensions for different encoders, so i get error in decoder.

does data parallel module assume the first dimension is the batch? LSTM module assumes Batch is the middle dimension.

ptrblck · August 19, 2019, 9:45am

Yes, the batch will be chunked in dim0.
You could try to permute the data or use batch_first=True in your LSTM.

aksg87 · September 2, 2019, 11:06pm

Hello,

I have been trying to train additional models / do work on a second GPU of a machine but am running into issues. I have confirmed that torch.cuda recognizes 2 GPUs but I cannot switch to second GPU to train different models in parallel.

aksg87 · September 2, 2019, 11:38pm

I think I solved this by adding:

torch.backends.cudnn.enabled=True

bao18 · June 17, 2020, 7:55am

Many thanks for your help.
I got the following error message, could you tell me how to fix it? thanks a lot.

RuntimeError: expected device cuda:3 and dtype Float but got device cuda:0 and dtype Float

ptrblck · June 17, 2020, 8:30am

If you are using nn.DataParallel, this error is often raised if new tensors are created in the forward method and pushed to the default device (cuda:0).
Could you post the model definition so that we could have a look?

bao18 · June 17, 2020, 9:09am

Many thanks for your prompt reply
Here is my forward definition. The network is quite big, but it is a UNet-based for semantic segmentation using a shared encoder.

    def forward(self, x):
        features_x1 = self.encoder(x[:, : self.in_channels // 2, :, :])
        features_x2 = self.encoder(x[:, self.in_channels // 2 :, :, :])
        features = [torch.cat([x1, x2], 1) for x1, x2 in zip(features_x1, features_x2)]
        features = [self.res[i].to(x.device)(x) for i, x in enumerate(features)]

        decoder_output = self.decoder(*features)
        masks = self.segmentation_head(decoder_output)

        return masks

ptrblck · June 18, 2020, 2:16am

Is it necessary to call self.res[i].to(x.device) in the list comprehension?
Based on the code, it should be a no-op, since self.res should already be on the corresponding device, if it’s registered as a module.
Could you post the complete error message including the line of code, which raises this error?

bao18 · June 22, 2020, 2:07pm

Hello (sorry for the late reply)

Regarding your first question, self.res[i].to(x.device).
I do this because I need to reduce the size (c*2) of the concatenated features in the previous step.
Here, self.res() = nn.Conv2d(c*2, c, stride=1, 0) where c is the size of x1 and x2

As for the full error messages:

Epoch: 1
train:   0%|                                                                       | 0/146 [00:12<?, ?it/s]
Traceback (most recent call last):
  File "train_network.py", line 112, in <module>
    train_logs, *_ = train_epoch.run(train_dataloader)
  File "/uge_mnt/home/bruno/parallel/codes/tools.py", line 125, in run
    loss, y_pred = self.batch_update(x, y)
  File "/uge_mnt/home/bruno/parallel/codes/tools.py", line 180, in batch_update
    loss.backward()
  File "/home/xxx/apps/intelpython3/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/xxx/apps/intelpython3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: expected device cuda:3 and dtype Float but got device cuda:2 and dtype Float

many thanks in advance for your help

ptrblck · June 23, 2020, 1:00am

Thanks for the follow-up.
Could you post a minimal, executable code snippet to reproduce this error please, as I cannot see any issues in the current code?