Split single model in multiple gpus

I would like to train a model where it contains 2 sub-modules. I would like to train sub-model 1 in one gpu and sub-model 2 in another gpu. How would i do in pytorch? I tried specifying cuda device separately for each sub-module but it throws an error.
Error: RuntimeError: tensors are on different GPUs

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5).cuda(device=1)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = x.cuda(device=1)
        conv2_in_gpu1 = self.conv2(x)
        x = x.cuda(device=0)
        x = F.relu(F.max_pool2d(self.conv2_drop(conv2_in_gpu1), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

model = Net()
if args.cuda:

This is just an example of what i was trying to achieve. I would like the self.conv2 to be performed in gpu1 and rest in gpu0.


What kind of error do you get?

This should work:

class MyModel(nn.Module):
    def __init__(self, split_gpus):
        self.large_submodule1 = ...
        self.large_submodule2 = ...

        self.split_gpus = split_gpus
        if split_gpus:

    def forward(self, x):
        x = self.large_submodule1(x)
        if split_gpus:
            x = x.cuda(1) # P2P GPU transfer
        return self.large_submodule2(x)

I have updated the post with error that I got. I tried your approach which was similar to mine except that I did not use P2P GPU transfer. I still get the same error. RuntimeError: tensors are on different GPUs

Could you please post the code throwing this error?

Is it absolutely necessary to specify? I have had excellent results with
something like:

model = <specify model here>
model = torch.nn.DataParallel(net, device_ids=range(torch.cuda.device_count()))

Which just runs your model as efficiently as possible over as many gpus as
you have.

1 Like

Your approach will use data parallelism, which splits the data/batch onto different GPUs.
I think @Ashwin_Raju would like to use model sharding. At least the error sounds like this to me. :wink:


I have updated the post with the example that i was trying to achieve basically performing self.conv2 in a separate gpu

It seems that in this line

x = F.relu(F.max_pool2d(self.conv2_drop(conv2_in_gpu1), 2))

conv2_in_gpu1 is still on GPU1, while self.conv2_drop etc. are on GPU0. You only transferred x back to GPU0.

Btw, what is your use case?

1 Like

Also, I think another error is in calling model.cuda(device=0).
This will move all model parameters and buffers to the GPU specified with the id.
So your self.conv2 layer will also be moved to GPU0, which might cause the same error again.

If you really need model sharding, I think you should define the splits in the __init__ function of your model.


Since my entire module is computation intensive I would like to compute half of my work in 1 gpu and the other in another gpu. I will look into these corrections now

Sounds logical, but note that you won’t gain any performance gains when your model is built in a sequential manner like in your example. The GPUs will have to wait for each other, since they cannot start working without the preceding data.

In that case, you might follow @voxmenthe 's suggestion of using DataParallel.


Yeah. I agree . Your solution works :slight_smile:

Can we do both DataParallel and splitting the model in different GPUs?

1 Like

If I do something like this:

# GPU 0:
loss_c = a * b 

# GPU 1:
loss_f = d * e 

And then I add them together (converting one output to the other’s GPU):

total_loss = loss_c + loss_f.cuda(0)

Then when I run backward() on total_loss, will the backward pass split off onto both GPUs? Or just will it just take place on GPU 0?

1 Like

This is a bit tricky, but is possible.
I’ve created a small code example, which uses model sharing and DataParallel.
It’s using 4 GPUs, where each submodule is split on 2 GPUs as a DataParallel module:

class SubModule(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(SubModule, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, 1, 1)
    def forward(self, x):
        print('SubModule, device: {}, shape: {}\n'.format(x.device, x.shape))
        x = self.conv1(x)
        return x

class MyModel(nn.Module):
    def __init__(self, split_gpus, parallel):
        super(MyModel, self).__init__()
        self.module1 = SubModule(3, 6)
        self.module2 = SubModule(6, 1)
        self.split_gpus = split_gpus
        self.parallel = parallel
        if self.split_gpus and self.parallel:
            self.module1 = nn.DataParallel(self.module1, device_ids=[0, 1]).to('cuda:0')
            self.module2 = nn.DataParallel(self.module2, device_ids=[2, 3]).to('cuda:2')
    def forward(self, x):
        print('Input: device {}, shape {}\n'.format(x.device, x.shape))
        x = self.module1(x)
        print('After module1: device {}, shape {}\n'.format(x.device, x.shape))
        x = self.module2(x)
        print('After module2: device {}, shape {}\n'.format(x.device, x.shape))
        return x

model = MyModel(split_gpus=True, parallel=True)
x = torch.randn(16, 3, 24, 24).to('cuda:0')
output = model(x)

The script will output:

Input: device cuda:0, shape torch.Size([16, 3, 24, 24])

SubModule, device: cuda:0, shape: torch.Size([8, 3, 24, 24])

SubModule, device: cuda:1, shape: torch.Size([8, 3, 24, 24])

After module1: device cuda:0, shape torch.Size([16, 6, 24, 24])

SubModule, device: cuda:2, shape: torch.Size([8, 6, 24, 24])
SubModule, device: cuda:3, shape: torch.Size([8, 6, 24, 24])

After module2: device cuda:2, shape torch.Size([16, 1, 24, 24])

EDIT: As you can see, I just implemented this one use case. So the conditions on self.split_gpu and self.parallel are a bit useless. However, this should give you a starter for your code.


Thanks to @ptrblck for your great solution answer.

I just want to remind that it also requires you to move your predicted results tensor and the ground-truth target tensor into the same GPU because of the loss function.

When the loss function calculates the difference between target and prediction tensor, it needs to have both of them on the same GPU.

1 Like

Hi voxmenthe, I think your solution is the answer that most people are looking for!

thank you very much! I solve my problem use this code simply.


hi thanks for your solution, but when i do this I get error in my loss function :
buffer[torch.eq(target, -1.)] = 0
RuntimeError: invalid argument 2: sizes do not match at /opt/conda/conda-bld/pytorch_1512946747676/work/torch/lib/THC/generated/…/generic/THCTensorMasked.cu:13
this is not an error in my code but an error popping up after using parallelism of data ( as i tried to run my less intensive code both with and without data parallelism and it throws up same error while using it with data parallelsim)

my model is intensive and I have 2 gpu’s 12206MiB each. I just need to split my model to use both gpu’s while training as well as testing.


btw my model is a fcn and its batch size is 1

Hi Rao,

I guess the problem results from your model is too huge to hold for 1 GPU only.

You need to split model directly on different GPU when writing it from scratch, split training data cannot help and check the below thread.

sorry I don’t have experience to write a multi-GPU training model.


can you explain what does this means, so you are using gpus 0 and 1 to be used for datapralel and then what .to('cuda:0') means/does?