DataParallel imbalanced memory usage

You should be able to use, maybe with minor modifications, the one I posted above. It works, at least in 0.4.0

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import torch
import torch.nn as nn

class FullModel(nn.Module):
  def __init__(self, model, loss):
    super(FullModel, self).__init__()
    self.model = model
    self.loss = loss

  def forward(self, targets, *inputs):
    outputs = self.model(*inputs)
    loss = self.loss(outputs, targets)
    return torch.unsqueeze(loss,0),outputs

def DataParallel_withLoss(model,loss,**kwargs):
    model=FullModel(model, loss)
    if 'device_ids' in kwargs.keys():
    if 'output_device' in kwargs.keys():
    if 'cuda' in kwargs.keys():
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda(cudaID)
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda()
    return model
class toy(nn.Module):
    def __init__(self):
        super(toy, self).__init__()
        self.conv2d = torch.nn.Conv2d(1,3,1)
    def forward(self,x):
        return self.conv2d(x)
model = toy()
optimizer = torch.optim.SGD(model.parameters(),lr=1)
loss = torch.nn.L1Loss()
model = DataParallel_withLoss(model,loss)
gt = torch.rand(2,3,10,10)
input = torch.rand(2,1,10,10)
loss,_ = model(gt,input)
loss = loss.sum()

toy example


The toy example is brevity.
According to your example, I reproduced it in my codes . However, there is a weird bug. Do you know what cause this ?

Traceback (most recent call last):
  File "", line 363, in <module>
  File "", line 97, in main
    train(args, trainer, task, epoch_itr)
  File "", line 135, in train
    log_output = trainer.train_step(sample, update_params=True)
  File "/data/mmyin/tf-datapallelism/fairseq/", line 120, in train_step
    loss, sample_size, logging_output, oom_fwd = self._forward(sample)
  File "/data/mmyin/tf-datapallelism/fairseq/", line 212, in _forward
    oss, sample_size, logging_output_ = self.full_model(sample)
  File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/", line 124, in forward
    return self.gather(outputs, self.output_device)
  File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/", line 136, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/", line 67, in gather
    return gather_map(outputs)
  File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
 File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration

Well that error means that zip requires itarable inputs, such as list or tuples. Therefore it is taking as input something which is non iterable. What exactly? I don’t know since I don’t have your implementation

Thanks for your advice.
I have a question about final loss. Because I found the final loss is not sum of a batch, but a list containing mini batch loss.
For example, a big batch is divided into four parts for computing on 4 GPUs, while loss,_ = model(gt,input) will return a list containing four partial losses.
Is that correct and does need to sum them up manually?

Yeh sorry, you should sum them before applying loss backward.
I rewrote the toy example with a 2dconv

We had the same issue, in that we could only train with a much smaller batch size when parallelizing.
Using DistributedDataParallel in both model and loss got us much better results. You have to use DistributedSampler and init_process_group, but it’s all in this example:

However, we have not seen massive improvements in speed, probably due to our slow dataloader/data transfer as our input size is quite large…
Both methods, DistributedDataParallel or DataParallel, running on a AWS P3 with 8 GPUs barely improved at all compared to a single GPU (perhaps the variation on the time required for an epoch is reduced, but the average time is about the same). That doesn’t make much sense, has anyone seen the same problem?

hi, I found your toy code solution for the dataparallel problem.Your work is fantastic.
But when I immitated it on my own code, things went wrong.
it gave me RuntimeError: all tensors must be on devices[0]
Here is my Model_with_parallel:

class G_FullModel(nn.Module):
    def __init__(self, actor, discriminator,g_loss):
        super(G_FullModel, self).__init__()
        self.G = actor
        self.D = discriminator
        self.loss = g_loss

    def forward(self, targets, inputs, corpus):
        fake_reply, word_probabilities, hiddens = G.sample(inputs, targets, TF=0)
        num_samples = 3
        rewards = G.monte_carlo(D, inputs, fake_reply, hiddens, num_samples,
        pg_loss = loss(rewards, word_probabilities)
        return torch.unsqueeze(pg_loss, 0), outputs

Here is how I used it:

G_model_parallel = G_FullModel(actor, discriminator, g_loss)
G_model_parallel = torch.nn.DataParallel(G_model_parallel, device_ids=[1, 2, 3],output_device=[1]).cuda()

context =
reply =
loss, _ = G_model_parallel(reply, context, corpus)

Beg for your help!
I have tried to dataparallel my model and loss partly, the code could run.But still the GPU-Util on other gpus except device1 is almostly zero.
thank you very much!!

I’ve had the same imbalanced problem due to a very complex regularization method in CNN. The method in this post is kind of complicated and requires a lot of code changing. If you are also using a layer wise regularization method, you can try gradient accumulation as a quick fix. The idea is to calculate the regularizer gradient of a single layer at a time, and let the graph be freed before calculating the next one.

I am having the same imbalance issue but the problem is that my gpu 1 not gpu 0 is going out of memory. Both gpus have 32GB of memory. With NVIDIA-SMI i see that gpu 0 is only using 6GB of memory whereas, gpu 1 goes to 32.
I could have understood if it was other way around with gpu 0 going out of memory but this is weird.
I only pass my model to the DataParallel so it’s using the default values.
Also, if I use only 1 GPU, i don’t get any out of memory issues. This is also strange for me.
Any help would be appreciated.
p.s. I was getting warning about rnn parameters not being in contiguous memory so i added the flatten_parameters() call as well in forward of lstm

 cudaID = str(torch.cuda.current_device())
 device = torch.device("cuda:" + cudaID) 
 print('device = ', device) // this prints cuda:0

if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    encoder = torch.nn.DataParallel(encoder)
    lstm_model = torch.nn.DataParallel(lstm_model)

// forward method of lstm

def forward(self, inputs, mode='train'):
    packed = tn.pack_sequence(inputs, enforce_sorted=False)
    self.hidden = self.init_hidden(len(inputs), // sending device of packed so both packed and self.hidden are on same device, as self.hidden is created in every call and im using multiple gpus
    if mode == 'eval' or mode == 'test':
        with torch.no_grad():
            packed_out, self.hidden = self.lstm(packed, self.hidden)
        packed_out, self.hidden = self.lstm(packed, self.hidden)

    outputs, lens = tn.pad_packed_sequence(packed_out, batch_first=True)

    return outputs