How to split an nn.Sequential VGG network over multiple GPUs?

ProGamerGov · May 15, 2018, 7:56pm

This function is supposed to take an nn.Sequential network and put different layers on different GPUs, depending on a user specified “strategy”:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-multigpu_strategy", default='4,9,14')
params = parser.parse_args()

def setup_multi_gpu(net):
    gpu_splits = params.multigpu_strategy.split(',')
    gpu = 0
    new_net = nn.Sequential()
    for i, layer in enumerate(net):
        if i == 0:
           new_layer = layer.cuda(0)
        else: 
           if i in gpu_splits:
               gpu+=1
           new_layer = layer.cuda(gpu)
        new_net.add_module(str(i), new_layer)
    return new_net.cuda()

Though the above function doesn’t seem to work. The first GPU has the same amount of usage regardless of the “strategy”, and the other GPUs only have a few hundred MiB of GPU usage.

The function is meant for conv nets. I did try using nn.DataParallel, but that didn’t seem to work, which is why I tried to create a solution with the function above. My input has a batch size of 1, and I can’t seem to do net(input) after doing net = nn.DataParallel(net).

What am I doing wrong here?

Edit:

The batch size should be larger than the number of GPUs used.

Source: torch.nn — PyTorch master documentation

So I can’t use DataParallel because my code uses a batch size of 1.

ProGamerGov · May 15, 2018, 11:03pm

Using this:

def setup_multi_gpu(net):
    gpu_splits = params.multigpu_strategy.split(',')
    gpu = 0
    new_net = nn.Sequential()
    new_net.cuda() 
    for i, layer in enumerate(net):
        if i == 0:
           new_layer = layer.cuda(gpu)
        else: 
           if i in gpu_splits:
               gpu+=1
           new_layer = layer.cuda(1)
        new_net.add_module(str(i), new_layer)

    return new_net

Results in:

    net(input)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)

Do I need to add some sort of device conversion layer when switching GPUs?

ptrblck · May 15, 2018, 11:35pm

If you would like to use model sharding, you have to create the modules on the right GPUs and push the tensors in the forward to the appropriate GPU.
Have a look at @apaszke’s code sample:

class MyModel(nn.Module):
    def __init__(self, split_gpus):
        self.large_submodule1 = ...
        self.large_submodule2 = ...

        self.split_gpus = split_gpus
        if split_gpus:
            self.large_submodule1.cuda(0)
            self.large_submodule1.cuda(1)

    def forward(self, x):
        x = self.large_submodule1(x)
        if split_gpus:
            x = x.cuda(1) # P2P GPU transfer
        return self.large_submodule2(x)

ProGamerGov · May 16, 2018, 6:39am

nn.DataParallel doesn’t work with a batch size of 1, but what about the functional version of DataParallel, data_parallel? data_parallel seems like it might be the PyTorch version of Lua/Torch7’s nn.GPU()

The functions themselves even seem to have basically the same inputs:

def data_parallel(module, inputs, device_ids=None, output_device=None, dim=0, module_kwargs=None):

function GPU:__init(module, device, outdevice)

Would something like this (put inside a loop) work better than just putting if statements everywhere for multiple gpus? Each nn.data_parallel module would have 1 “input gpu” and 1 “output gpu”.

net.add_module(layer_name, nn.data_parallel(layers[i], gpus[i], out_device))

Source:

github.com

torch/nn/blob/master/GPU.lua

------------------------------------------------------------------------
--[[ GPU ]]--
-- Decorates a module such that its parameters are
-- hosted on a specified GPU device.
-- The operations are also executed on that device.
-- Arguments input and gradOutput are converted to the specified device 
-- before being fed to the decorated module. 
-- Returned output is on the specified outdevice (defaults to device). 
-- Returned gradInput is allocated on the same device as the input.
-- The unit test is located in cunn.
------------------------------------------------------------------------
local GPU, parent = torch.class("nn.GPU", "nn.Container")

function GPU:__init(module, device, outdevice)
   parent.__init(self)
   assert(torch.type(device) == 'number')
   self.device = device
   self.outdevice = outdevice or device
   
   assert(torch.isTypeOf(module, 'nn.Module'))

This file has been truncated. show original

github.com

pytorch/pytorch/blob/master/torch/nn/parallel/data_parallel.py#L133-L163


def data_parallel(module, inputs, device_ids=None, output_device=None, dim=0, module_kwargs=None):
r"""Evaluates module(input) in parallel across the GPUs given in device_ids.


This is the functional version of the DataParallel module.


Args:
    module: the module to evaluate in parallel
    inputs: inputs to the module
    device_ids: GPU ids on which to replicate module
    output_device: GPU location of the output  Use -1 to indicate the CPU.
        (default: device_ids[0])
Returns:
    a Tensor containing the result of module(input) located on
    output_device
"""
if not isinstance(inputs, tuple):
    inputs = (inputs,)


if device_ids is None:
    device_ids = list(range(torch.cuda.device_count()))

This file has been truncated. show original

ProGamerGov · May 17, 2018, 12:24am

This is what I have so far:

def setup_multi_gpu(net):
    gpu_splits = params.multigpu_strategy.split(',')
    gpus = [0,1,2,3]
    cur_chunk = nn.Sequential()
    chunks = []
    for i, l in enumerate(net):
         cur_chunk.add_module(str(i), net[i])
         if str(i) in gpu_splits and gpu_splits != '':
             del gpu_splits[0]
             chunks.append(cur_chunk)
             cur_chunk = nn.Sequential()
    chunks.append(cur_chunk)

    new_net = nn.Sequential()
    for i, chunk in enumerate(chunks):
         out_device = gpus[i]
         if i == len(chunks):
             out_device = gpus[0]
         new_net.add_module(str(i), nn.DataParallel(chunks[i], [gpus[i]], out_device))

    return new_net

But I am getting this error in my closure function:

RuntimeError: arguments are located on different GPUs at /home/ubuntu/pytorch/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:233