Split single model in multiple gpus

Since my entire module is computation intensive I would like to compute half of my work in 1 gpu and the other in another gpu. I will look into these corrections now

Sounds logical, but note that you won’t gain any performance gains when your model is built in a sequential manner like in your example. The GPUs will have to wait for each other, since they cannot start working without the preceding data.

In that case, you might follow @voxmenthe 's suggestion of using DataParallel.


Yeah. I agree . Your solution works :slight_smile:

Can we do both DataParallel and splitting the model in different GPUs?

1 Like

If I do something like this:

# GPU 0:
loss_c = a * b 

# GPU 1:
loss_f = d * e 

And then I add them together (converting one output to the other’s GPU):

total_loss = loss_c + loss_f.cuda(0)

Then when I run backward() on total_loss, will the backward pass split off onto both GPUs? Or just will it just take place on GPU 0?

1 Like

This is a bit tricky, but is possible.
I’ve created a small code example, which uses model sharing and DataParallel.
It’s using 4 GPUs, where each submodule is split on 2 GPUs as a DataParallel module:

class SubModule(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(SubModule, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, 1, 1)
    def forward(self, x):
        print('SubModule, device: {}, shape: {}\n'.format(x.device, x.shape))
        x = self.conv1(x)
        return x

class MyModel(nn.Module):
    def __init__(self, split_gpus, parallel):
        super(MyModel, self).__init__()
        self.module1 = SubModule(3, 6)
        self.module2 = SubModule(6, 1)
        self.split_gpus = split_gpus
        self.parallel = parallel
        if self.split_gpus and self.parallel:
            self.module1 = nn.DataParallel(self.module1, device_ids=[0, 1]).to('cuda:0')
            self.module2 = nn.DataParallel(self.module2, device_ids=[2, 3]).to('cuda:2')
    def forward(self, x):
        print('Input: device {}, shape {}\n'.format(x.device, x.shape))
        x = self.module1(x)
        print('After module1: device {}, shape {}\n'.format(x.device, x.shape))
        x = self.module2(x)
        print('After module2: device {}, shape {}\n'.format(x.device, x.shape))
        return x

model = MyModel(split_gpus=True, parallel=True)
x = torch.randn(16, 3, 24, 24).to('cuda:0')
output = model(x)

The script will output:

Input: device cuda:0, shape torch.Size([16, 3, 24, 24])

SubModule, device: cuda:0, shape: torch.Size([8, 3, 24, 24])

SubModule, device: cuda:1, shape: torch.Size([8, 3, 24, 24])

After module1: device cuda:0, shape torch.Size([16, 6, 24, 24])

SubModule, device: cuda:2, shape: torch.Size([8, 6, 24, 24])
SubModule, device: cuda:3, shape: torch.Size([8, 6, 24, 24])

After module2: device cuda:2, shape torch.Size([16, 1, 24, 24])

EDIT: As you can see, I just implemented this one use case. So the conditions on self.split_gpu and self.parallel are a bit useless. However, this should give you a starter for your code.


Thanks to @ptrblck for your great solution answer.

I just want to remind that it also requires you to move your predicted results tensor and the ground-truth target tensor into the same GPU because of the loss function.

When the loss function calculates the difference between target and prediction tensor, it needs to have both of them on the same GPU.

1 Like

Hi voxmenthe, I think your solution is the answer that most people are looking for!

thank you very much! I solve my problem use this code simply.


hi thanks for your solution, but when i do this I get error in my loss function :
buffer[torch.eq(target, -1.)] = 0
RuntimeError: invalid argument 2: sizes do not match at /opt/conda/conda-bld/pytorch_1512946747676/work/torch/lib/THC/generated/…/generic/THCTensorMasked.cu:13
this is not an error in my code but an error popping up after using parallelism of data ( as i tried to run my less intensive code both with and without data parallelism and it throws up same error while using it with data parallelsim)

my model is intensive and I have 2 gpu’s 12206MiB each. I just need to split my model to use both gpu’s while training as well as testing.


btw my model is a fcn and its batch size is 1

Hi Rao,

I guess the problem results from your model is too huge to hold for 1 GPU only.

You need to split model directly on different GPU when writing it from scratch, split training data cannot help and check the below thread.

sorry I don’t have experience to write a multi-GPU training model.


can you explain what does this means, so you are using gpus 0 and 1 to be used for datapralel and then what .to('cuda:0') means/does?

This makes sure that the outputs will be gathered on GPU0, which will calculate the loss and scatter it to the replicas again.
The general method is beautifully explained in this blog post.


Thanks for providing this solution.

I want to ask that if the large_submodule1 and large_submodule2 should have different optimizer?

Thank you!

Not necessarily, but it depends on your use case of course.
You can pass all model parameters to a single optimizer, define per-parameter options or use different optimizers. The split in submodules does not limit your options and you could handle the submodules as single layers.

1 Like

Got it, thank you!

BTW, I facing the similar problem as this discussion, the GPU memory on the fisrt GPU cost much more than it should be.

I can’t figure out what wrong with it. Do you know where may be the problem?
Thank you!

Have a look at my previous post with the link to the blog post of @Thomas_Wolf:

He explains why this imbalanced memory usage is happening and also gives some workarounds.

1 Like

Thanks for your contribution, @ptrblck.

I think @Ashwin_Raju would like to use model sharding

Could you further elaborate about, or give some reference on what is model sharding? I’m interested about it. Thanks!

Thank you for your reply!

However, my question is not about the ‘unbalanced GPU usage’ , is about the GPU memory allocated.

I notice that when I split the whole model in 4 gpus and do forward/backward, the GPU memory on the fisrt GPU cost much more than it should be. For example, if the whole model cost 12GB on a single GPU, when split it to four GPUs, the first GPU cost 11GB and the sum of others cost about 11GB.
Is there an explaination for how does the GPU memory be malloced when using multiple GPUs for model parallelism.

Still not knowing how to deal with the problem, hope you can help me!

Thank you!

Thanks for the clarification. I now understand you are sharding your model, i.e. some layers are in GPU0, the next ones on GPU1, etc. Using this approach the memory usage of all GPUs is approx. 2x the memory usage of the same model on a single GPU using the same setup (e.g. same batch size etc.).
If I understood your issue correctly, could you post a (small) reproducible code snippet so that we could have a look?

1 Like

Hi Paulo,

yeah, I just realized that I misunderstood @Ashwin_Raju so thanks for pointing it out.
Model sharding or model parallelism refers to splitting the model onto several GPUs, so the forward and backward graph are going through all devices.
You could use it, e.g. if you have a huge model and can use multiple GPUs.
A simple dummy case is given in this previous post.