Run Pytorch on Multiple GPUs

  1. torch.device('cuda') will use the default CUDA device. It should be the same as cuda:0 in the default setup. However, if you are using a context manager as described in this example (e.g. with torch.cuda.device(1):), 'cuda' will refer to the specified device.
  2. In the default context, they will be the same. However, I think input.cuda() will also behave like the default device as in point 1. I would recommend to stick to the .to() operator, as the code is quite easy to be written in a device-agnostic way.
  3. I’m unfortunately not familiar with torchtext, but based on the doc, your suggestion makes sense. Let’s wait for other answers on this point. :wink:
  4. Yes, that’s right. You’ll see an unbalanced GPU usages as beautifully explained by @Thomas_Wolf in his blog post.
  5. Regarding nn.DistributedDataParallel I try to stick to the NVIDIA apex examples. I’m currently not sure, if there is still a difference between the apex and PyTorch implementation of DistributedDataParallel or if they are on par now. Maybe @mcarilli or @ngimel might have an answer for this point.
  6. I’m not sure and would guess not. However, I’ve seen some paper explaining the momentum might be adapted for large batch sizes. Take this info with a grain of salt and let’s hear other opinions.
7 Likes

@ptrblck is this an absolute requirement to have num_workers=0 for multiple GPUs training?

No, it’s not a requirement. Do you see any issues using multiple workers?

It is probably not the source of my problem. Thanks for the quick reply. I’ll post a code snippet here if I don’t solve this in the next hour.

@ptrblck from what I understand as of now, and after trial and errors + reading this quote:

Data Parallelism is when we split the mini-batch of samples into multiple smaller mini-batches and run the computation for each of the smaller mini-batches in parallel.

When you wrap your model in nn.DataParallel, the big idea is that you can increase your batch size without increasing your training time per batch. Say you have one GPU training a batch size of 16, it will approximately take the same time for 8 similar GPUs to train a batch size of 128 (16*8).

Is that line of reasoning correct?

edit/extra comment:
It also seems that the number of workers for the dataloader can play on the data loading bottleneck, thus training time. When I was using 20 workers on a 20 CPUs+8*V100 on GCP/Paperspace it was training slower (but I can’t tell the exact reason). Once I reduced the workers to 15, the training time per epoch was reduced by 4x.

1 Like

That would be the ideal linear scaling you could achieve, thus reducing the epoch time by number of GPUs.

Too many CPU workers might slow down the data loading. I’m not an expert on this topic, but always refer to @rwightman’s post.

2 Likes

Hello,
I am working on video recognition and my each batch size is roughly around (150,3,224,224). I have 4 GPU, if I use dataparallel it will split the batch size. How to solve the problem when the single batch is too big.
Regards

If you need this batch size, you could try to trade compute for memory using checkpoint.
I haven’t tried it with nn.DataParallel yet, but it should work.

Hi, @ptrblck
Thank you for ur nice answers, but I still have a problem when using pytorch multiple gpus.
I get very imbalanced gpu memery usage. when I want to use larger batch_size, I will get “OUT OF MEMORY” problem.


And I am very sure my code is right.(I follow the instructions of the pytorch tutorial for multiple gpus)
What can I do to fully utilize the GPU memories?

1 Like

The usage seems to be way too imbalanced for a typical nn.DataParallel use case.
In my previous post I mentioned the blog post in point 4, which explains the imbalance in memory usage, however in your current setup it looks like device1-3 are also creating the CUDA context.
Are you seeing any usage in the GPU-Util section of nvidia-smi?

1 Like

Hi @kevin_sandy,

I’m having exactly this same issue. I’m trying to parallelize across 2 GPUs but only one is showing high memory usage (say 23000MiB) and the other one 11MiB (basically nothing).

I’m also implementing correctly the nn.DataParallel(model) from the tutorial.

Were you able to find a workaround for this?

Best

Hi,

When I do this DataParallel to make my model run on two GPUs, my model is getting changed. I mean, the children structure of my model is getting changed. With single GPU, I could see two children, but after using DataParallel I could see only one child of the model.
Can some one please clarify on this.
Thank you.

nn.DataParallel wraps the model into model.module. Could this explain the observed change?

1 Like

I tried to change the number of frozen layers of vgg16 model. When I used one GPU, I could see that model has two children and I could even fine-tune only certain layers. But, when I used that nn.DataParallel, I am not able to see the same and could not fine-tune some layers. Please let me know the solution if any.

Could you post some code showing, how you are freezing the layers and what doesn’t work in nn.DataParallel?

1 Like

Hello :slight_smile:

I have same issue in nn.DataParallel, as you can see below it is imbalance. So, I am not able to increase the batch size.

An imbalance in nn.DataParallel might be seen due to the scattering and gathering of the inputs etc. on the default device as described in the blog post linked in point 4.

My model is

class salnet(nn.Module):
def init(self):
super(salnet,self).init()

    vgg16 = models.vgg16(pretrained=True)
    
    encoder = list(vgg16.features.children())[:-1]
    self.encoder = nn.Sequential(*encoder)
    #for param in encoder.parameters():
        #param.requires_grad = False
    self.decoder = nn.Conv2d(512,1,1,padding=0,bias=False)
def forward(self,x):
    e_x = self.encoder(x)
    d_x = self.decoder(e_x)
    #e_x = nn.functional.interpolate(e_x,size=(480,640),mode='bilinear',align_corners=False)
    d_x = nn.functional.interpolate(d_x,size=(360,640),mode='bilinear',align_corners=False)
    d_x = d_x.squeeze(1)
    mi = t.min(d_x.view(-1,360*640),1)[0].view(-1,1,1)
    ma = t.max(d_x.view(-1,360*640),1)[0].view(-1,1,1)
    n_x = (d_x-mi)/(ma-mi)
    return e_x,n_x

Now, I am freezing some layers like:

child_counter = 0
for child in model.children():
print(" child", child_counter, “is:”)
print(child)
child_counter += 1
print("=======")
#for child in model.children():
#for param in child.parameters():
#print(param)
#break
#break
child_counter = 0
for child in model.children():
if child_counter == 0:
children_of_child_counter = 0
for children_of_child in child.children():
if (children_of_child_counter > 16) and (children_of_child_counter < 30):
for param in children_of_child.parameters():
param.requires_grad = True
print(‘child ‘, children_of_child_counter, ‘of child’,child_counter,’ is not frozen’)
children_of_child_counter += 1
#children_of_child_counter += 1
elif (children_of_child_counter < 17):
print(‘child ‘, children_of_child_counter, ‘of child’,child_counter,’ is frozen’)
children_of_child_counter += 1
child_counter += 1
elif child_counter==1:
param.requires_grad = True
print(“child “,child_counter,” is not frozen”)

When I do this freezing without nn.DataParallel, it gives me output as:

child 0 is:
Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace=True)
(7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU(inplace=True)
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace=True)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): ReLU(inplace=True)
(14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): ReLU(inplace=True)
(16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(18): ReLU(inplace=True)
(19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(20): ReLU(inplace=True)
(21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(22): ReLU(inplace=True)
(23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(25): ReLU(inplace=True)
(26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(27): ReLU(inplace=True)
(28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(29): ReLU(inplace=True)
)

child 1 is:
Conv2d(512, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)

child 0 of child 0 is frozen
child 1 of child 0 is frozen
child 2 of child 0 is frozen
child 3 of child 0 is frozen
child 4 of child 0 is frozen
child 5 of child 0 is frozen
child 6 of child 0 is frozen
child 7 of child 0 is frozen
child 8 of child 0 is frozen
child 9 of child 0 is frozen
child 10 of child 0 is frozen
child 11 of child 0 is frozen
child 12 of child 0 is frozen
child 13 of child 0 is frozen
child 14 of child 0 is frozen
child 15 of child 0 is frozen
child 16 of child 0 is frozen
child 17 of child 0 is not frozen
child 18 of child 0 is not frozen
child 19 of child 0 is not frozen
child 20 of child 0 is not frozen
child 21 of child 0 is not frozen
child 22 of child 0 is not frozen
child 23 of child 0 is not frozen
child 24 of child 0 is not frozen
child 25 of child 0 is not frozen
child 26 of child 0 is not frozen
child 27 of child 0 is not frozen
child 28 of child 0 is not frozen
child 1 is not frozen

But when I use nn.DataParallel, it is showing both the children under the “Sequential ()” model.

Hi how to parallelize 2 network architectures into multiple GPU? I have 2 networks let said A and B. I tried to use this code.

modelA = torch.nn.DataParallel(modelA).cuda(GPU_ID)
modelB = torch.nn.DataParallel(modelB).cuda(GPU_ID)

Then I got this error.
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

How to do this?, should I merge the model architecture into 1? Let said My_Network contains ModelA and ModelB so to access it would be My_Network.ModelA and My_Network.ModelB.

Could you check, if the models are running fine in your code in isolation, i.e. modelA alone and afterwards modelB alone?
I don’t think the error is raised because you are using two models, but might be raised if one of the models get an input tensor, which is stored on a wrong device.