[Error] Parallelizing networks over multiple GPUs

I am trying to parallelize the following network - https://pastebin.com/raw/FgehWHw0
by using DataParallel as suggested on the forums as - network = torch.nn.DataParallel(network, device_ids = args.gpus) where network is the a network of the type RNN Encoder and args.gpus is a list of available GPU device IDs.

I keep running into the following error in doing so - https://pastebin.com/raw/0nfmmhkz

When I was trying to run this on a single GPU (defauly pytorch setting) I was getting an out of memory error, which made me resort to parallelizing my code.

Is there a reasonable fix?

DataParallel splits the batch between different GPUs. How many GPUs are you using and what is your batch size?
Also, since you are out of memory I doubt DataParallel will help you, since it replicates the model to all devices.
Have you thought about model sharding?

class MyModel(nn.Module):
    def __init__(self, split_gpus):
        self.large_submodule1 = ...
        self.large_submodule2 = ...

        self.split_gpus = split_gpus
        if split_gpus:
            self.large_submodule1.cuda(0)
            self.large_submodule1.cuda(1)

    def forward(self, x):
        x = self.large_submodule1(x)
        if split_gpus:
            x = x.cuda(1) # P2P GPU transfer
        return self.large_submodule2(x)

Thanks for your reply. I have 6 GPUs (I was using 3 of those to run this code as my device ids), my batch size is 128. I haven’t used model sharding. Will try it out. Thanks for the pointer!

Hi @ptrblck,
I was trying out your solution of model sharding. However, I am having trouble doing loss.backward() Since my loss is a sum of the losses from the 2 decoders I am situating on different GPUs. Would you have pointers about a workaround to this issue?

My error trace specifically is:

Train epch 1, 0.00 s - (Done 1 of 73) Traceback (most recent call last):
  File "main_interlingua.py", line 432, in <module>
    loss, norm_e, norm_d = train(args, train_batch, encoder, decoder, decoder2, encoder_optimizer, decoder_optimizer, decoder_optimizer2, criterion)
  File "main_interlingua.py", line 181, in train
    loss.backward()
  File "/usr0/home/spoddar2/anaconda/lib/python2.7/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/usr0/home/spoddar2/anaconda/lib/python2.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1512378422383/work/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:269

How did you calculate your loss?
It should be something like loss = loss0 + loss1.cuda(0).
Could you post the code snippet right before the error was thrown?