Help with running a sequential model across multiple GPUs, in order to make use of more GPU memory

ProGamerGov · September 22, 2018, 8:01pm

I’m looking for a way to replicate some behavior from Lua/Torch7’s nn.GPU. Basically one could split a single model across multiple GPUs, and then run that model backwards without having to modify the “Closure” function:

github.com

jcjohnson/neural-style/blob/master/neural_style.lua#L360-L405


      
          function setup_multi_gpu(net, params)
            local DEFAULT_STRATEGIES = {
              [2] = {3},
            }
            local gpu_splits = nil
            if params.multigpu_strategy == '' then
              -- Use a default strategy
              gpu_splits = DEFAULT_STRATEGIES[#params.gpu]
              -- Offset the default strategy by one if we are using TV
              if params.tv_weight > 0 then
                for i = 1, #gpu_splits do gpu_splits[i] = gpu_splits[i] + 1 end
              end
            else
              -- Use the user-specified multigpu strategy
              gpu_splits = params.multigpu_strategy:split(',')
              for i = 1, #gpu_splits do
                gpu_splits[i] = tonumber(gpu_splits[i])
              end
            end
            assert(gpu_splits ~= nil, 'Must specify -multigpu_strategy')

This file has been truncated. show original

github.com

jcjohnson/neural-style/blob/master/neural_style.lua#L275-L298


      
          -- Function to evaluate loss and gradient. We run the net forward and
          -- backward to get the gradient, and sum up losses from the loss modules.
          -- optim.lbfgs internally handles iteration and calls this function many
          -- times, so we manually count the number of iterations to handle printing
          -- and saving intermediate results.
          local num_calls = 0
          local function feval(x)
            num_calls = num_calls + 1
            net:forward(x)
            local grad = net:updateGradInput(x, dy)
            local loss = 0
            for _, mod in ipairs(content_losses) do
              loss = loss + mod.loss
            end
            for _, mod in ipairs(style_losses) do
              loss = loss + mod.loss
            end
            maybe_print(num_calls, loss)
            maybe_save(num_calls)

This file has been truncated. show original

The key to this seems to have been the nn.GPU function in Torch7:

github.com

torch/nn/blob/master/doc/simple.md#gpu

<a name="nn.simplelayers.dok"></a>
# Simple layers #
Simple Modules are used for various tasks like adapting Tensor methods and providing affine transformations :

  * Parameterized Modules :
    * [Linear](#nn.Linear) : a linear transformation ;
    * [LinearWeightNorm](#nn.LinearWeightNorm) : a weight normalized linear transformation ;
    * [SparseLinear](#nn.SparseLinear) : a linear transformation with sparse inputs ;
    * [IndexLinear](#nn.IndexLinear) : an alternative linear transformation with for sparse inputs and max normalization ;
    * [Bilinear](#nn.Bilinear) : a bilinear transformation with sparse inputs ;
    * [PartialLinear](#nn.PartialLinear) : a linear transformation with sparse inputs with the option of only computing a subset ;
    * [Add](#nn.Add) : adds a bias term to the incoming data ;
    * [CAdd](#nn.CAdd) : a component-wise addition to the incoming data ;
    * [Mul](#nn.Mul) : multiply a single scalar factor to the incoming data ;
    * [CMul](#nn.CMul) : a component-wise multiplication to the incoming data ;
    * [Euclidean](#nn.Euclidean) : the euclidean distance of the input to `k` mean centers ;
    * [WeightedEuclidean](#nn.WeightedEuclidean) : similar to [Euclidean](#nn.Euclidean), but additionally learns a diagonal covariance matrix ;
    * [Cosine](#nn.Cosine) : the cosine similarity of the input to `k` mean centers ;
    * [Kmeans](#nn.Kmeans) : [Kmeans](https://en.wikipedia.org/wiki/K-means_clustering) clustering layer;
  * Modules that adapt basic Tensor methods :

This file has been truncated. show original

The intended use-case is not for model-parallelism where the models are executed in parallel on multiple devices, but for sequential models where a single GPU doesn’t have enough memory.

In trying to replicate this in PyTorch, I started trying to use nn.DataParallel:

github.com

ProGamerGov/neural-style-pt/blob/multi-gpu/neural_style.py#L292-L315


      
          
          multidevice = False
          if "," in str(params.gpu): 
              params.gpu = params.gpu.split(',')
              multidevice = True
          
              if 'c' in str(params.gpu[0]).lower():
                  backward_device = "cpu"
                  setup_cuda()
                  setup_cpu()
              else:
                  backward_device = "cuda:" + params.gpu[0]
                  setup_cuda()
              dtype = torch.FloatTensor
          
          elif "c" not in str(params.gpu).lower():
              setup_cuda()
              dtype = torch.cuda.FloatTensor
              backward_device = "cuda:" + str(params.gpu)
          else:

This file has been truncated. show original

However I still seem to have to manually convert the output of each set of model layers, to a single GPU. This single GPU then ends up having a really high memory usage that negates what I am trying to do. nn.DataParallel as I understand, is meant for batch sizes larger than 1, however I am only using a batch size of one (style transfer). The issue of nn.DataParallel using a lot of memory on a single GPU is documented here, here, and in many other posts.

Basically I am trying to separate a sequential model into a set of smaller models across multiple GPUs, so that one can use more GPU memory and thus larger inputs/outputs can be used. I am not sure that nn.DataParallel is the best option for what I am trying to do, but I am not aware of any alternatives which would work.

Here’s a basic MSPaint diagram of what I am currently doing in my code, with an example of 4 GPUs. The total number of GPUs, and how many layers to give each GPU, is meant to be entirely user controlled.