Split single model in multiple gpus

ptrblck · November 20, 2018, 7:03pm

This makes sure that the outputs will be gathered on GPU0, which will calculate the loss and scatter it to the replicas again.
The general method is beautifully explained in this blog post.

CindGao · May 23, 2019, 3:54am

Thanks for providing this solution.

I want to ask that if the large_submodule1 and large_submodule2 should have different optimizer?

Thank you!

ptrblck · May 23, 2019, 10:06am

Not necessarily, but it depends on your use case of course.
You can pass all model parameters to a single optimizer, define per-parameter options or use different optimizers. The split in submodules does not limit your options and you could handle the submodules as single layers.

CindGao · May 23, 2019, 10:13am

Got it, thank you!

BTW, I facing the similar problem as this discussion, the GPU memory on the fisrt GPU cost much more than it should be.

I can’t figure out what wrong with it. Do you know where may be the problem?
Thank you!

ptrblck · May 23, 2019, 10:18am

Have a look at my previous post with the link to the blog post of @Thomas_Wolf:

He explains why this imbalanced memory usage is happening and also gives some workarounds.

Paulo_Mann · May 23, 2019, 1:00pm

Thanks for your contribution, @ptrblck.

I think @Ashwin_Raju would like to use model sharding

Could you further elaborate about, or give some reference on what is model sharding? I’m interested about it. Thanks!

CindGao · May 27, 2019, 1:38am

Thank you for your reply!

However, my question is not about the ‘unbalanced GPU usage’ , is about the GPU memory allocated.

"
I notice that when I split the whole model in 4 gpus and do forward/backward, the GPU memory on the fisrt GPU cost much more than it should be. For example, if the whole model cost 12GB on a single GPU, when split it to four GPUs, the first GPU cost 11GB and the sum of others cost about 11GB.
Is there an explaination for how does the GPU memory be malloced when using multiple GPUs for model parallelism.
"

Still not knowing how to deal with the problem, hope you can help me!

Thank you!

ptrblck · May 27, 2019, 10:20am

Thanks for the clarification. I now understand you are sharding your model, i.e. some layers are in GPU0, the next ones on GPU1, etc. Using this approach the memory usage of all GPUs is approx. 2x the memory usage of the same model on a single GPU using the same setup (e.g. same batch size etc.).
If I understood your issue correctly, could you post a (small) reproducible code snippet so that we could have a look?

ptrblck · May 27, 2019, 10:26am

Hi Paulo,

yeah, I just realized that I misunderstood @Ashwin_Raju so thanks for pointing it out.
Model sharding or model parallelism refers to splitting the model onto several GPUs, so the forward and backward graph are going through all devices.
You could use it, e.g. if you have a huge model and can use multiple GPUs.
A simple dummy case is given in this previous post.

Sharifi · December 5, 2020, 12:12am

ptrblck:

This should work:

class MyModel(nn.Module):
    def __init__(self, split_gpus):
        self.large_submodule1 = ...
        self.large_submodule2 = ...

        self.split_gpus = split_gpus
        if split_gpus:
            self.large_submodule1.cuda(0)
            self.large_submodule1.cuda(1)

    def forward(self, x):
        x = self.large_submodule1(x)
        if split_gpus:
            x = x.cuda(1) # P2P GPU transfer
        return self.large_submodule2(x)

I think there is a typo in if statement in __ init __ function.
I guess the second one is wrong and the submodule number should be 2:
self.large_submodule2.cuda(1)

ptrblck · December 5, 2020, 12:56am

Yes, thanks for catching it!

EJHyun · November 17, 2021, 7:33am

Hi thank you for your example!
But It does not work…with error

RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)

with code

import torch
import torch.nn as nn

class SubModule(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(SubModule, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, 1, 1)
        
    def forward(self, x):
        print('SubModule, device: {}, shape: {}\n'.format(x.device, x.shape))
        x = self.conv1(x)
        return x


class MyModel(nn.Module):
    def __init__(self, split_gpus, parallel):
        super(MyModel, self).__init__()
        self.module1 = SubModule(3, 6)
        self.module2 = SubModule(6, 1)
        
        self.split_gpus = split_gpus
        self.parallel = parallel
        if self.split_gpus and self.parallel:
            self.module1 = self.module1.to('cuda:0')
            self.module2 = self.module2.to('cuda:1')
        
    def forward(self, x):
        print('Input: device {}, shape {}\n'.format(x.device, x.shape))
        x = self.module1(x)
        print('After module1: device {}, shape {}\n'.format(x.device, x.shape))
        x = self.module2(x)
        print('After module2: device {}, shape {}\n'.format(x.device, x.shape))
        return x


model = MyModel(split_gpus=True, parallel=True)
x = torch.randn(16, 3, 24, 24).to('cuda:0')
output = model(x)

Should I add x.to(‘cuda:1’)
before Submodule2?

ptrblck · November 17, 2021, 8:04am

Since you removed the nn.DataParallel wrappers, you would need to manually push the tensors to the expected device via:

    def forward(self, x):
        print('Input: device {}, shape {}\n'.format(x.device, x.shape))
        x = self.module1(x)
        print('After module1: device {}, shape {}\n'.format(x.device, x.shape))
        x = x.to('cuda:1')
        x = self.module2(x)
        print('After module2: device {}, shape {}\n'.format(x.device, x.shape))
        return x

royve · January 11, 2022, 9:20am

Hi Guys,
I wonder how everything you discussed in relation to model sharding and data parallelism will be affected when using NVLink (for example, connecting two GPUs with an NVLink bridge). Will it just make things faster, or will it open new scenarios?

ptrblck · January 12, 2022, 6:52pm

It will speed up your workloads.

Aleks1 · January 25, 2022, 9:46am

Hi Peter,

I am also trying to share my model on different GPUs. But I am not sure if it’s really efficient. Since I am using over 6000 custom layers I execute them in a for loop. All layers are independent from each other and can run in parallel. Here is my original forum question. Maybe you can help?