Split single model in multiple gpus

Thanks to @ptrblck for your great solution answer.

I just want to remind that it also requires you to move your predicted results tensor and the ground-truth target tensor into the same GPU because of the loss function.

When the loss function calculates the difference between target and prediction tensor, it needs to have both of them on the same GPU.

1 Like

Hi voxmenthe, I think your solution is the answer that most people are looking for!

thank you very much! I solve my problem use this code simply.


hi thanks for your solution, but when i do this I get error in my loss function :
buffer[torch.eq(target, -1.)] = 0
RuntimeError: invalid argument 2: sizes do not match at /opt/conda/conda-bld/pytorch_1512946747676/work/torch/lib/THC/generated/…/generic/THCTensorMasked.cu:13
this is not an error in my code but an error popping up after using parallelism of data ( as i tried to run my less intensive code both with and without data parallelism and it throws up same error while using it with data parallelsim)

my model is intensive and I have 2 gpu’s 12206MiB each. I just need to split my model to use both gpu’s while training as well as testing.


btw my model is a fcn and its batch size is 1

Hi Rao,

I guess the problem results from your model is too huge to hold for 1 GPU only.

You need to split model directly on different GPU when writing it from scratch, split training data cannot help and check the below thread.

sorry I don’t have experience to write a multi-GPU training model.


can you explain what does this means, so you are using gpus 0 and 1 to be used for datapralel and then what .to('cuda:0') means/does?

This makes sure that the outputs will be gathered on GPU0, which will calculate the loss and scatter it to the replicas again.
The general method is beautifully explained in this blog post.


Thanks for providing this solution.

I want to ask that if the large_submodule1 and large_submodule2 should have different optimizer?

Thank you!

Not necessarily, but it depends on your use case of course.
You can pass all model parameters to a single optimizer, define per-parameter options or use different optimizers. The split in submodules does not limit your options and you could handle the submodules as single layers.

1 Like

Got it, thank you!

BTW, I facing the similar problem as this discussion, the GPU memory on the fisrt GPU cost much more than it should be.

I can’t figure out what wrong with it. Do you know where may be the problem?
Thank you!

Have a look at my previous post with the link to the blog post of @Thomas_Wolf:

He explains why this imbalanced memory usage is happening and also gives some workarounds.

1 Like

Thanks for your contribution, @ptrblck.

I think @Ashwin_Raju would like to use model sharding

Could you further elaborate about, or give some reference on what is model sharding? I’m interested about it. Thanks!

Thank you for your reply!

However, my question is not about the ‘unbalanced GPU usage’ , is about the GPU memory allocated.

I notice that when I split the whole model in 4 gpus and do forward/backward, the GPU memory on the fisrt GPU cost much more than it should be. For example, if the whole model cost 12GB on a single GPU, when split it to four GPUs, the first GPU cost 11GB and the sum of others cost about 11GB.
Is there an explaination for how does the GPU memory be malloced when using multiple GPUs for model parallelism.

Still not knowing how to deal with the problem, hope you can help me!

Thank you!

Thanks for the clarification. I now understand you are sharding your model, i.e. some layers are in GPU0, the next ones on GPU1, etc. Using this approach the memory usage of all GPUs is approx. 2x the memory usage of the same model on a single GPU using the same setup (e.g. same batch size etc.).
If I understood your issue correctly, could you post a (small) reproducible code snippet so that we could have a look?

1 Like

Hi Paulo,

yeah, I just realized that I misunderstood @Ashwin_Raju so thanks for pointing it out.
Model sharding or model parallelism refers to splitting the model onto several GPUs, so the forward and backward graph are going through all devices.
You could use it, e.g. if you have a huge model and can use multiple GPUs.
A simple dummy case is given in this previous post.


I think there is a typo in if statement in __ init __ function.
I guess the second one is wrong and the submodule number should be 2:

Yes, thanks for catching it! :slight_smile:

Hi thank you for your example!
But It does not work…with error

RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)

with code

import torch
import torch.nn as nn

class SubModule(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(SubModule, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, 1, 1)
    def forward(self, x):
        print('SubModule, device: {}, shape: {}\n'.format(x.device, x.shape))
        x = self.conv1(x)
        return x

class MyModel(nn.Module):
    def __init__(self, split_gpus, parallel):
        super(MyModel, self).__init__()
        self.module1 = SubModule(3, 6)
        self.module2 = SubModule(6, 1)
        self.split_gpus = split_gpus
        self.parallel = parallel
        if self.split_gpus and self.parallel:
            self.module1 = self.module1.to('cuda:0')
            self.module2 = self.module2.to('cuda:1')
    def forward(self, x):
        print('Input: device {}, shape {}\n'.format(x.device, x.shape))
        x = self.module1(x)
        print('After module1: device {}, shape {}\n'.format(x.device, x.shape))
        x = self.module2(x)
        print('After module2: device {}, shape {}\n'.format(x.device, x.shape))
        return x

model = MyModel(split_gpus=True, parallel=True)
x = torch.randn(16, 3, 24, 24).to('cuda:0')
output = model(x)

Should I add x.to(‘cuda:1’)
before Submodule2?

Since you removed the nn.DataParallel wrappers, you would need to manually push the tensors to the expected device via:

    def forward(self, x):
        print('Input: device {}, shape {}\n'.format(x.device, x.shape))
        x = self.module1(x)
        print('After module1: device {}, shape {}\n'.format(x.device, x.shape))
        x = x.to('cuda:1')
        x = self.module2(x)
        print('After module2: device {}, shape {}\n'.format(x.device, x.shape))
        return x
1 Like

Hi Guys,
I wonder how everything you discussed in relation to model sharding and data parallelism will be affected when using NVLink (for example, connecting two GPUs with an NVLink bridge). Will it just make things faster, or will it open new scenarios?

It will speed up your workloads.