DataParallel imbalanced memory usage

Hi,

You could do that but transferring few GB of data many times between the cpu and gpu will be quite slow.
But in practice, as I said above, what is going to take a lot of space are the intermediary results. And that is going to be even worst.

That being said, if your problem is that your net is too big, even for a batch size of 1, then you don’t want data parallelism but model parallelism.
This can be done very easily by adding some .cuda(1) then .cuda(2)… during your forward pass, and sending part of your net to different gpus. according to this split.
There is no builtin method to do this as it need to be done in your forward method directly and you need to ensure that the right part of each model is moved to the right gpu.

That makes sense to me! It could be great if you guys can make a post/blog/sample code about what we discussing here. It must be very helpful!

Thanks a lot,
Best,
Yuzhou

Hi there @albanD, @Yuzhou_Song
I noticed there is an small mistake with the code you provided:
It’s necessary to unsqueeze loss inside forward pass to DataParallel were able to build loss back. Loss provided by PyTorch loss functions seems not to have dimensions, and DataParallel mount batch back in dim 0 by default. It also requires to class super class not to raise up with an error. Anyway thank you very much. Awesome help.

class FullModel(nn.Module):
  def __init__(self, model, loss):
    super(FullModel, self).__init__()
    self.model = model
    self.loss = loss

  def forward(self, targets, *inputs):
    outputs = self.model(*inputs)
    loss = self.loss(outputs, targets)
    return torch.unsqueeze(loss,0),outputs
    

def DataParallel_withLoss(model,loss,**kwargs):
    model=FullModel(model, loss)
    if 'device_ids' in kwargs.keys():
        device_ids=kwargs['device_ids']
    else:
        device_ids=None
    if 'output_device' in kwargs.keys():
        output_device=kwargs['output_device']
    else:
        output_device=None
    if 'cuda' in kwargs.keys():
        cudaID=kwargs['cuda'] 
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda(cudaID)
    else:
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda()
    return model

I was trying this implementation and it works.
However imbalance keeps happening

So, as far as i could check printing parameter.device model weights are stored in gpu by-default gpu. In the case I run, we can roughly say each mini-batch requires ~5Gb and model weights ~4Gb (8500-4500)
There is a gain in computing loss in a distributed way.

Is not possible to share model weights among all gpus? Cos in this example, i can increase batch size until gpu uses 12Gb, but it would mean gpu1 and gpu2 would be using only 8.

Even a worse case, if i were using Adam (you mentioned it copies all the model weigths), memory usage in GPU0 would be 12 Gb meanwhile gpu1 and gpu2 would be using 4.

If this were not possible due to pytorch requirements (I guess pytorch requires all tensors to be in the same gpu to be able to operate them)
Is there are way of penalizing workload, this is, an imbalanced amount of samples per gpu to reduce by-default gpu memory requirements?

Toy example
BS=30
gpu0–>4
gpu1–>13
gpu2–>13

1 Like

I am by no mean a specialist on DataParallel and I have a cpu-only install so I couldn’t test the code.
The doc states explicitly that if the module returns a 0dim tensor, then you will get a 1D tensor with size the number of gpus. That does not work? Maybe that’s only supported in the latest version.

Sharing the weights among gpus is not possible as they need to be on the device to perform computations.

I am not sure at the moment what creates this inbalance…
Maybe @smth knows? :slight_smile:

I am able to reach a relatively balanced memory usage by incorporating the loss calculation in the forward pass. Thanks for the help
19%20PM

1 Like

Perfect !

That means that the inbalance that @JuanFMontesinos sees is due to something in his code that creates that. Not a problem with DataParallel.

1 Like

Well glad to hear that, then i will find a way to solve it ^^ Ty @albanD for your help.
Just for discarding, how many parameters does your net have @soulless?
Cos I’m using a very deep multimodal net

I am using a CNN autoencoder with around 8 million trainable parameters. How many parameters do you have?

I guess, not confirmed yet, that if we save/load a model and this model is directly stored from gpu, the memory is not freed.

I mean, I’ve been working is several clusters and saving model directly from dataparallel module. I’ve had all these troubles loading pretrained weights from this dataparallel state dic. However I found that saving the model prepared for cpu, loading the model again and then moving it to cuda uses less memory.

Do you mean it would save more memory by moving the dataparallel model back to the cpu model before saving it ?

Dear soulless,

Do you by any chance still have the script around for which you made this work?

I am trying to do what albanD suggested above, but it just won’t work. It’d be really nice to have an example to look at.

Regards,
Hjules

You should be able to use, maybe with minor modifications, the one I posted above. It works, at least in 0.4.0

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import torch
import torch.nn as nn

class FullModel(nn.Module):
  def __init__(self, model, loss):
    super(FullModel, self).__init__()
    self.model = model
    self.loss = loss

  def forward(self, targets, *inputs):
    outputs = self.model(*inputs)
    loss = self.loss(outputs, targets)
    return torch.unsqueeze(loss,0),outputs
    

def DataParallel_withLoss(model,loss,**kwargs):
    model=FullModel(model, loss)
    if 'device_ids' in kwargs.keys():
        device_ids=kwargs['device_ids']
    else:
        device_ids=None
    if 'output_device' in kwargs.keys():
        output_device=kwargs['output_device']
    else:
        output_device=None
    if 'cuda' in kwargs.keys():
        cudaID=kwargs['cuda'] 
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda(cudaID)
    else:
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda()
    return model
class toy(nn.Module):
    def __init__(self):
        super(toy, self).__init__()
        self.conv2d = torch.nn.Conv2d(1,3,1)
    def forward(self,x):
        return self.conv2d(x)
model = toy()
optimizer = torch.optim.SGD(model.parameters(),lr=1)
loss = torch.nn.L1Loss()
model = DataParallel_withLoss(model,loss)
gt = torch.rand(2,3,10,10)
input = torch.rand(2,1,10,10)
loss,_ = model(gt,input)
loss = loss.sum()
optimizer.zero_grad()
loss.backward()
optimizer.step()

toy example

3 Likes

The toy example is brevity.
According to your example, I reproduced it in my codes . However, there is a weird bug. Do you know what cause this ?

Traceback (most recent call last):
  File "train.py", line 363, in <module>
    main(args)
  File "train.py", line 97, in main
    train(args, trainer, task, epoch_itr)
  File "train.py", line 135, in train
    log_output = trainer.train_step(sample, update_params=True)
  File "/data/mmyin/tf-datapallelism/fairseq/trainer.py", line 120, in train_step
    loss, sample_size, logging_output, oom_fwd = self._forward(sample)
  File "/data/mmyin/tf-datapallelism/fairseq/trainer.py", line 212, in _forward
    oss, sample_size, logging_output_ = self.full_model(sample)
  File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward
    return self.gather(outputs, self.output_device)
  File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 136, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
    return gather_map(outputs)
  File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
 File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration

Well that error means that zip requires itarable inputs, such as list or tuples. Therefore it is taking as input something which is non iterable. What exactly? I don’t know since I don’t have your implementation

Thanks for your advice.
I have a question about final loss. Because I found the final loss is not sum of a batch, but a list containing mini batch loss.
For example, a big batch is divided into four parts for computing on 4 GPUs, while loss,_ = model(gt,input) will return a list containing four partial losses.
Is that correct and does need to sum them up manually?

Yeh sorry, you should sum them before applying loss backward.
I rewrote the toy example with a 2dconv

We had the same issue, in that we could only train with a much smaller batch size when parallelizing.
Using DistributedDataParallel in both model and loss got us much better results. You have to use DistributedSampler and init_process_group, but it’s all in this example: https://github.com/pytorch/examples/blob/master/imagenet/main.py

However, we have not seen massive improvements in speed, probably due to our slow dataloader/data transfer as our input size is quite large…
Both methods, DistributedDataParallel or DataParallel, running on a AWS P3 with 8 GPUs barely improved at all compared to a single GPU (perhaps the variation on the time required for an epoch is reduced, but the average time is about the same). That doesn’t make much sense, has anyone seen the same problem?

hi, I found your toy code solution for the dataparallel problem.Your work is fantastic.
But when I immitated it on my own code, things went wrong.
it gave me RuntimeError: all tensors must be on devices[0]
Here is my Model_with_parallel:

class G_FullModel(nn.Module):
    def __init__(self, actor, discriminator,g_loss):
        super(G_FullModel, self).__init__()
        self.G = actor
        self.D = discriminator
        self.loss = g_loss

    def forward(self, targets, inputs, corpus):
        fake_reply, word_probabilities, hiddens = G.sample(inputs, targets, TF=0)
        num_samples = 3
        rewards = G.monte_carlo(D, inputs, fake_reply, hiddens, num_samples,
                                           corpus).detach()
        
        pg_loss = loss(rewards, word_probabilities)
        return torch.unsqueeze(pg_loss, 0), outputs

Here is how I used it:

G_model_parallel = G_FullModel(actor, discriminator, g_loss)
G_model_parallel = torch.nn.DataParallel(G_model_parallel, device_ids=[1, 2, 3],output_device=[1]).cuda()

context = context.to(device)
reply = reply.to(device)
loss, _ = G_model_parallel(reply, context, corpus)

Beg for your help!
I have tried to dataparallel my model and loss partly, the code could run.But still the GPU-Util on other gpus except device1 is almostly zero.
thank you very much!!

I’ve had the same imbalanced problem due to a very complex regularization method in CNN. The method in this post is kind of complicated and requires a lot of code changing. If you are also using a layer wise regularization method, you can try gradient accumulation as a quick fix. The idea is to calculate the regularizer gradient of a single layer at a time, and let the graph be freed before calculating the next one.

I am having the same imbalance issue but the problem is that my gpu 1 not gpu 0 is going out of memory. Both gpus have 32GB of memory. With NVIDIA-SMI i see that gpu 0 is only using 6GB of memory whereas, gpu 1 goes to 32.
I could have understood if it was other way around with gpu 0 going out of memory but this is weird.
I only pass my model to the DataParallel so it’s using the default values.
Also, if I use only 1 GPU, i don’t get any out of memory issues. This is also strange for me.
Any help would be appreciated.
p.s. I was getting warning about rnn parameters not being in contiguous memory so i added the flatten_parameters() call as well in forward of lstm

 cudaID = str(torch.cuda.current_device())
 device = torch.device("cuda:" + cudaID) 
 print('device = ', device) // this prints cuda:0

if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    encoder = torch.nn.DataParallel(encoder)
    lstm_model = torch.nn.DataParallel(lstm_model)


encoder.to(device)
lstm_model.to(device)

// forward method of lstm

def forward(self, inputs, mode='train'):
    packed = tn.pack_sequence(inputs, enforce_sorted=False)
    self.hidden = self.init_hidden(len(inputs), packed.data.device) // sending device of packed so both packed and self.hidden are on same device, as self.hidden is created in every call and im using multiple gpus
    self.lstm.flatten_parameters()
    if mode == 'eval' or mode == 'test':
        with torch.no_grad():
            packed_out, self.hidden = self.lstm(packed, self.hidden)
    else:
        packed_out, self.hidden = self.lstm(packed, self.hidden)

 
    outputs, lens = tn.pad_packed_sequence(packed_out, batch_first=True)

    return outputs