How to split an nn.Sequential VGG network over multiple GPUs?


#1

This function is supposed to take an nn.Sequential network and put different layers on different GPUs, depending on a user specified “strategy”:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-multigpu_strategy", default='4,9,14')
params = parser.parse_args()

def setup_multi_gpu(net):
    gpu_splits = params.multigpu_strategy.split(',')
    gpu = 0
    new_net = nn.Sequential()
    for i, layer in enumerate(net):
        if i == 0:
           new_layer = layer.cuda(0)
        else: 
           if i in gpu_splits:
               gpu+=1
           new_layer = layer.cuda(gpu)
        new_net.add_module(str(i), new_layer)
    return new_net.cuda()

Though the above function doesn’t seem to work. The first GPU has the same amount of usage regardless of the “strategy”, and the other GPUs only have a few hundred MiB of GPU usage.

The function is meant for conv nets. I did try using nn.DataParallel, but that didn’t seem to work, which is why I tried to create a solution with the function above. My input has a batch size of 1, and I can’t seem to do net(input) after doing net = nn.DataParallel(net).

What am I doing wrong here?

Edit:

The batch size should be larger than the number of GPUs used.

Source: https://pytorch.org/docs/master/nn.html?highlight=dataparallel#torch.nn.DataParallel

So I can’t use DataParallel because my code uses a batch size of 1.


#2

Using this:

def setup_multi_gpu(net):
    gpu_splits = params.multigpu_strategy.split(',')
    gpu = 0
    new_net = nn.Sequential()
    new_net.cuda() 
    for i, layer in enumerate(net):
        if i == 0:
           new_layer = layer.cuda(gpu)
        else: 
           if i in gpu_splits:
               gpu+=1
           new_layer = layer.cuda(1)
        new_net.add_module(str(i), new_layer)

    return new_net

Results in:

    net(input)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)

Do I need to add some sort of device conversion layer when switching GPUs?


#3

If you would like to use model sharding, you have to create the modules on the right GPUs and push the tensors in the forward to the appropriate GPU.
Have a look at @apaszke’s code sample:

class MyModel(nn.Module):
    def __init__(self, split_gpus):
        self.large_submodule1 = ...
        self.large_submodule2 = ...

        self.split_gpus = split_gpus
        if split_gpus:
            self.large_submodule1.cuda(0)
            self.large_submodule1.cuda(1)

    def forward(self, x):
        x = self.large_submodule1(x)
        if split_gpus:
            x = x.cuda(1) # P2P GPU transfer
        return self.large_submodule2(x)

#4

nn.DataParallel doesn’t work with a batch size of 1, but what about the functional version of DataParallel, data_parallel? data_parallel seems like it might be the PyTorch version of Lua/Torch7’s nn.GPU()

The functions themselves even seem to have basically the same inputs:

def data_parallel(module, inputs, device_ids=None, output_device=None, dim=0, module_kwargs=None):
function GPU:__init(module, device, outdevice)

Would something like this (put inside a loop) work better than just putting if statements everywhere for multiple gpus? Each nn.data_parallel module would have 1 “input gpu” and 1 “output gpu”.

net.add_module(layer_name, nn.data_parallel(layers[i], gpus[i], out_device))

Source:


#5

This is what I have so far:

def setup_multi_gpu(net):
    gpu_splits = params.multigpu_strategy.split(',')
    gpus = [0,1,2,3]
    cur_chunk = nn.Sequential()
    chunks = []
    for i, l in enumerate(net):
         cur_chunk.add_module(str(i), net[i])
         if str(i) in gpu_splits and gpu_splits != '':
             del gpu_splits[0]
             chunks.append(cur_chunk)
             cur_chunk = nn.Sequential()
    chunks.append(cur_chunk)

    new_net = nn.Sequential()
    for i, chunk in enumerate(chunks):
         out_device = gpus[i]
         if i == len(chunks):
             out_device = gpus[0]
         new_net.add_module(str(i), nn.DataParallel(chunks[i], [gpus[i]], out_device))

    return new_net

But I am getting this error in my closure function:

RuntimeError: arguments are located on different GPUs at /home/ubuntu/pytorch/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:233