Run Pytorch on Multiple GPUs

Hi,

When I do this DataParallel to make my model run on two GPUs, my model is getting changed. I mean, the children structure of my model is getting changed. With single GPU, I could see two children, but after using DataParallel I could see only one child of the model.
Can some one please clarify on this.
Thank you.

nn.DataParallel wraps the model into model.module. Could this explain the observed change?

1 Like

I tried to change the number of frozen layers of vgg16 model. When I used one GPU, I could see that model has two children and I could even fine-tune only certain layers. But, when I used that nn.DataParallel, I am not able to see the same and could not fine-tune some layers. Please let me know the solution if any.

Could you post some code showing, how you are freezing the layers and what doesn’t work in nn.DataParallel?

1 Like

Hello :slight_smile:

I have same issue in nn.DataParallel, as you can see below it is imbalance. So, I am not able to increase the batch size.

An imbalance in nn.DataParallel might be seen due to the scattering and gathering of the inputs etc. on the default device as described in the blog post linked in point 4.

My model is

class salnet(nn.Module):
def init(self):
super(salnet,self).init()

    vgg16 = models.vgg16(pretrained=True)
    
    encoder = list(vgg16.features.children())[:-1]
    self.encoder = nn.Sequential(*encoder)
    #for param in encoder.parameters():
        #param.requires_grad = False
    self.decoder = nn.Conv2d(512,1,1,padding=0,bias=False)
def forward(self,x):
    e_x = self.encoder(x)
    d_x = self.decoder(e_x)
    #e_x = nn.functional.interpolate(e_x,size=(480,640),mode='bilinear',align_corners=False)
    d_x = nn.functional.interpolate(d_x,size=(360,640),mode='bilinear',align_corners=False)
    d_x = d_x.squeeze(1)
    mi = t.min(d_x.view(-1,360*640),1)[0].view(-1,1,1)
    ma = t.max(d_x.view(-1,360*640),1)[0].view(-1,1,1)
    n_x = (d_x-mi)/(ma-mi)
    return e_x,n_x

Now, I am freezing some layers like:

child_counter = 0
for child in model.children():
print(" child", child_counter, “is:”)
print(child)
child_counter += 1
print("=======")
#for child in model.children():
#for param in child.parameters():
#print(param)
#break
#break
child_counter = 0
for child in model.children():
if child_counter == 0:
children_of_child_counter = 0
for children_of_child in child.children():
if (children_of_child_counter > 16) and (children_of_child_counter < 30):
for param in children_of_child.parameters():
param.requires_grad = True
print(‘child ‘, children_of_child_counter, ‘of child’,child_counter,’ is not frozen’)
children_of_child_counter += 1
#children_of_child_counter += 1
elif (children_of_child_counter < 17):
print(‘child ‘, children_of_child_counter, ‘of child’,child_counter,’ is frozen’)
children_of_child_counter += 1
child_counter += 1
elif child_counter==1:
param.requires_grad = True
print(“child “,child_counter,” is not frozen”)

When I do this freezing without nn.DataParallel, it gives me output as:

child 0 is:
Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace=True)
(7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU(inplace=True)
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace=True)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): ReLU(inplace=True)
(14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): ReLU(inplace=True)
(16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(18): ReLU(inplace=True)
(19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(20): ReLU(inplace=True)
(21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(22): ReLU(inplace=True)
(23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(25): ReLU(inplace=True)
(26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(27): ReLU(inplace=True)
(28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(29): ReLU(inplace=True)
)

child 1 is:
Conv2d(512, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)

child 0 of child 0 is frozen
child 1 of child 0 is frozen
child 2 of child 0 is frozen
child 3 of child 0 is frozen
child 4 of child 0 is frozen
child 5 of child 0 is frozen
child 6 of child 0 is frozen
child 7 of child 0 is frozen
child 8 of child 0 is frozen
child 9 of child 0 is frozen
child 10 of child 0 is frozen
child 11 of child 0 is frozen
child 12 of child 0 is frozen
child 13 of child 0 is frozen
child 14 of child 0 is frozen
child 15 of child 0 is frozen
child 16 of child 0 is frozen
child 17 of child 0 is not frozen
child 18 of child 0 is not frozen
child 19 of child 0 is not frozen
child 20 of child 0 is not frozen
child 21 of child 0 is not frozen
child 22 of child 0 is not frozen
child 23 of child 0 is not frozen
child 24 of child 0 is not frozen
child 25 of child 0 is not frozen
child 26 of child 0 is not frozen
child 27 of child 0 is not frozen
child 28 of child 0 is not frozen
child 1 is not frozen

But when I use nn.DataParallel, it is showing both the children under the “Sequential ()” model.

Hi how to parallelize 2 network architectures into multiple GPU? I have 2 networks let said A and B. I tried to use this code.

modelA = torch.nn.DataParallel(modelA).cuda(GPU_ID)
modelB = torch.nn.DataParallel(modelB).cuda(GPU_ID)

Then I got this error.
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

How to do this?, should I merge the model architecture into 1? Let said My_Network contains ModelA and ModelB so to access it would be My_Network.ModelA and My_Network.ModelB.

Could you check, if the models are running fine in your code in isolation, i.e. modelA alone and afterwards modelB alone?
I don’t think the error is raised because you are using two models, but might be raised if one of the models get an input tensor, which is stored on a wrong device.

Hello everybody,
being new to the forum, I hope I’m not posting at the wrong place.

I am trying to tackle an scientific computing issue, which cannot be formulated as a nn-Module, at least not in a straightforward way. However, I would like to run it on multiple GPUs.

Currently, I can populate tensors on GPU, but only a single one. Is there a wrapper function which e.g. distributes my data on multiple GPUs automatically, or is this only implemented for nn-Layers?

Best wishes and thank you in advance

You could have a look at the functional parallel operations and try to use them manually.

Wrapping your model in nn.DataParallel is an easy way to use your GPUs.
Have a look at the parallelism tutorial .

@ptrblck sorry for making this conversation longer. But I just want to be 100% sure:

  1. Assuming from all the tutorials that you sent, I assume that if there are multiple GPUs available pytorch only ever uses 1 at a time, unless one uses the nn.DataParallel. Is that correct?

  2. if I run multiple jobs on the same machine with multiple GPUs will they be allocated to each different GPU? is that right?

  1. You have to explicitly use multiple GPUs either by using nn.DataParallel, nn.DistributedDataParallel or a manual approach. Otherwise you are correct, PyTorch will not use multiple GPUs (or even a single GPU) by default.

  2. If you specify different device ids (via model.to('cuda:X'), where X is the GPU id) or mask the device via CUDA_VISIBLE_DEVICES=X, each script will only use the specified device. This will not be done automatically, as the default device would be the same (cuda:0).

1 Like

Hey ptrblck. Any advice for how to deal with an error such as: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 31.75 GiB total capacity; 28.35 GiB already allocated; 2.22 GiB free; 28.37 GiB reserved in total by PyTorch).
I know this means that the model is very big. Previously, I solved this by feeding different layers between 2 devices as shown:

 def forward(self, inp):
        x1 = self.inc(inp.to('cuda:0'))
        sq1 = self.sq1(x1.to('cuda:0'))
        x2 = self.down1(x1.to('cuda:0'))
        sq2 = self.sq2(x2.to('cuda:0'))
        x3 = self.down2(x2.to('cuda:0'))
        sq3 = self.sq3(x3.to('cuda:0'))
        x4 = self.down3(x3.to('cuda:0'))
        sq4 = self.sq4(x4.to('cuda:0'))
        x5 = self.down4(x4.to('cuda:0'))
        sq5 = self.sq5(x5).to('cuda:0')
        x6 = self.up1(sq5, sq4).to('cuda:0')
        sq6 = self.sq6(x6).to('cuda:1')
        x7 = self.up2(sq6.to('cuda:0'), sq3).to('cuda:0')
        sq7 = self.sq7(x7).to('cuda:1')
        x8 = self.up3(sq7.to('cuda:0'), sq2).to('cuda:0')
        sq8 = self.sq8(x8).to('cuda:1')
        x = self.up4(sq8.to('cuda:0'), sq1).to('cuda:1')
        x = torch.cat([inp, x.to('cuda:0')], dim=1)
        x = self.outc(x)
        return torch.sigmoid(x)
    

For a new model, I dont want to go through this tedious process again. Also I have to more than 2 GPUs. Thank you!

Potential workarounds would be:

  • use DistributedDataParallel by feeding smaller batches to each model replica on the corresponding device
  • create blocks of submodules so that your model sharding approach would be easier to implement (less boilerplate to('cuda:x') calls)
  • apply torch.utils.checkpoint to trade compute for memory
  • reduce the batch size if possible.

I don’t care how PyTorch decides to allocate the data across the GPUs. I want that to be done automatically and avoid doing any sort of thing like model.layer.to('cuda:X). That is fulfilled if I do model = nn.DataParallel(model) (for any model/NN architecture), right?

Yes, nn.DataParallel will automatically create model copies on the passed device_ids and will scatter the input batch in dim0 to each device. The output will be on the default device.

1 Like

That is great! Thanks! I see one needs to do:

net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])

BUT I was wondering, how do we get device_ids in a robust way? sometimes I ron jobs with 1 gpu others with 2 etc. I want to be able to do:

net = torch.nn.DataParallel(model, device_ids=torch.get_gpu_ids())

and have my code always work. How can this be done? @ptrblck

Thanks for the help! You are always so amazingly helpful :slight_smile: :brain:

1 Like

If you always want to use all available GPUs, you could use

device_ids=list(range(torch.cuda.device_count()))
1 Like

If I understand correctly what you should do to run on multiple GPUs is for all GPUs

net = torch.nn.DataParallel(model, device_ids=list(range(torch.cuda.device_count())))

if you want to use a set of specific ones:

net = torch.nn.DataParallel(model, device_ids=[0,1,2,5,10,...])

Note: you actually need to do the tutorial for the following to work

but according to (https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) you should be using DistributedDataParallel e.g.

net = torch.nn.DistributedDataParallel(model, device_ids=list(range(torch.cuda.device_count())))
1 Like