How to split an nn.Sequential VGG network over multiple GPUs?

This function is supposed to take an nn.Sequential network and put different layers on different GPUs, depending on a user specified “strategy”:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-multigpu_strategy", default='4,9,14')
params = parser.parse_args()

def setup_multi_gpu(net):
    gpu_splits = params.multigpu_strategy.split(',')
    gpu = 0
    new_net = nn.Sequential()
    for i, layer in enumerate(net):
        if i == 0:
           new_layer = layer.cuda(0)
        else: 
           if i in gpu_splits:
               gpu+=1
           new_layer = layer.cuda(gpu)
        new_net.add_module(str(i), new_layer)
    return new_net.cuda()

Though the above function doesn’t seem to work. The first GPU has the same amount of usage regardless of the “strategy”, and the other GPUs only have a few hundred MiB of GPU usage.

The function is meant for conv nets. I did try using nn.DataParallel, but that didn’t seem to work, which is why I tried to create a solution with the function above. My input has a batch size of 1, and I can’t seem to do net(input) after doing net = nn.DataParallel(net).

What am I doing wrong here?

Edit:

The batch size should be larger than the number of GPUs used.

Source: https://pytorch.org/docs/master/nn.html?highlight=dataparallel#torch.nn.DataParallel

So I can’t use DataParallel because my code uses a batch size of 1.

Using this:

def setup_multi_gpu(net):
    gpu_splits = params.multigpu_strategy.split(',')
    gpu = 0
    new_net = nn.Sequential()
    new_net.cuda() 
    for i, layer in enumerate(net):
        if i == 0:
           new_layer = layer.cuda(gpu)
        else: 
           if i in gpu_splits:
               gpu+=1
           new_layer = layer.cuda(1)
        new_net.add_module(str(i), new_layer)

    return new_net

Results in:

    net(input)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)

Do I need to add some sort of device conversion layer when switching GPUs?

If you would like to use model sharding, you have to create the modules on the right GPUs and push the tensors in the forward to the appropriate GPU.
Have a look at @apaszke’s code sample:

class MyModel(nn.Module):
    def __init__(self, split_gpus):
        self.large_submodule1 = ...
        self.large_submodule2 = ...

        self.split_gpus = split_gpus
        if split_gpus:
            self.large_submodule1.cuda(0)
            self.large_submodule1.cuda(1)

    def forward(self, x):
        x = self.large_submodule1(x)
        if split_gpus:
            x = x.cuda(1) # P2P GPU transfer
        return self.large_submodule2(x)

nn.DataParallel doesn’t work with a batch size of 1, but what about the functional version of DataParallel, data_parallel? data_parallel seems like it might be the PyTorch version of Lua/Torch7’s nn.GPU()

The functions themselves even seem to have basically the same inputs:

def data_parallel(module, inputs, device_ids=None, output_device=None, dim=0, module_kwargs=None):
function GPU:__init(module, device, outdevice)

Would something like this (put inside a loop) work better than just putting if statements everywhere for multiple gpus? Each nn.data_parallel module would have 1 “input gpu” and 1 “output gpu”.

net.add_module(layer_name, nn.data_parallel(layers[i], gpus[i], out_device))

Source:

This is what I have so far:

def setup_multi_gpu(net):
    gpu_splits = params.multigpu_strategy.split(',')
    gpus = [0,1,2,3]
    cur_chunk = nn.Sequential()
    chunks = []
    for i, l in enumerate(net):
         cur_chunk.add_module(str(i), net[i])
         if str(i) in gpu_splits and gpu_splits != '':
             del gpu_splits[0]
             chunks.append(cur_chunk)
             cur_chunk = nn.Sequential()
    chunks.append(cur_chunk)

    new_net = nn.Sequential()
    for i, chunk in enumerate(chunks):
         out_device = gpus[i]
         if i == len(chunks):
             out_device = gpus[0]
         new_net.add_module(str(i), nn.DataParallel(chunks[i], [gpus[i]], out_device))

    return new_net

But I am getting this error in my closure function:

RuntimeError: arguments are located on different GPUs at /home/ubuntu/pytorch/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:233