DataParallel imbalanced memory usage


(Juan F Montesinos) #1

Hi there,
I’m going to re-edit the whole thread to introduce a unlikely behavior with DataParallel
Right now there are several recent posts about this topic and I would like to summarize the problem.



Right now it seems there is an imbalaced usage of GPUs when calling DataParallel.

From my experience and other users’ explanations I will explain why this happens:
Using DataParallel there are 4 types of data to consider:

  1. Input
  2. Output
  3. Ground-truth
  4. Optimizer Parameters

DataParallel has a main GPU, which is the GPU where the model is stored.
As @Yuzhou_Song indicates in another post, DataParallel splits the batch in as many GPUs as choosen, copy the model in each of them, compute the forward pass independently and then collect back to one GPU the outputs of each GPU to calculate the loss instead of computing loss independently in each GPU. This is the main cause of imbalance memory usage. Consider that ground-truth and output (target of loss) must be in the same GPU

I discovered by my self that some optimizers requires lot of memory to save their parameters. However all these parameters are located in the already mentioned main GPU. This makes the problem worse.

There is a last reason. Model inputs are usually allocated to GPU by using .cuda(), which usually points to the main GPU and generates more imbalance.

There are some ways to minimize this:
DataParallel have 2 arguments, device_ids which allows to choose in which GPUs the model will be trained out of all available GPUs (CUDA_VISIBLE_DEVICES) and output_device which allows to choose in which GPU output will be stored.

Let’s imagine we have 3 devices, [0,1,2].
calling
model = DataParallel(model).cuda() would set device0 as main gpu. Then, optimizers parameters will be stored here.
calling model = DataParallel(model,output_device=1).cuda() and grountruth.cuda(1) will collect all the outputs and compute loss in cuda:1
lastly, you can allocate inputs to cuda2.
This way the memory usage is distributed as much as possible.

Is there a way of solving this problem? @smth @ptrblck @albanD @soulless
I guess this behavior is very inconvenient.


#2

I am having a similar issue. Could this be because the loss calculation is not done in the forward function?


(Juan F Montesinos) #3

Well the op you cited is right, in the end pytorch collect the whole batch output in one gpu to calculate the loss. That’s what generates the imbalanced usage. At the same time optimizers’ parameters are stored in by-default gpu, what makes the problem worse. There is no apparent solution since dataparallel allows you to choose output gpu but the problem remains. There are several open threads asking about the same issue. I tried to play with dataparallel arguments: device_ids and output_device but no way.


(Alban D) #4

If you want the loss to be splitted among gpus, just make your loss layer part of the DataParallel and add a sum or mean operation on what you get out of it. That way if you use DataParallel on 4 devices, only 4 extra numbers will be allocated on the output_device. Is that a good solution for you?


(Juan F Montesinos) #5

Thanks it seems a good solution.
If i call loss=DataParallel(loss).cuda() I would get an array of losses?

Apart of that, is it possible to split optimizers among all gpus?


(Alban D) #6

You would need to have the following:

class FullModel(nn.Module):
  def __init__(self, model, loss):
    self.model = model
    self.loss = loss

  def forward(self, inputs, targets):
    outputs = self.model(inputs)
    loss = self.loss(outputs, targets)

full_model = DataParallel(FullModel(model, loss), device_ids=[0, 1, 2], ***)
full_model.cuda()

loss = full_model(fake_input, fake_target)
print(loss.size) # returns [3] (If you loss returns a 0-dimensional tensor containing the loss value)
final_loss = loss.sum()

For the optimizer that would depend on the optimizer I guess.
For Adam for example, it need to store one (if I remember correctly?) extra copy of all the weights. And it need this copy to update the weights. That means that if all the gradients are accumulated on a single gpu, then this state should be there as well.
That being said. The size of the weights of your network should not be a very large part of the memory consumption for classic nets. Intermediary states are the most demanding. So hopefully this extra memory as big as the set of weights should not be too big a problem.


(Yuzhou Song) #7

I had tried put the loss inside of forward() before. It did mitigate this imbalanced issue but not fully resolved it. BTW, I wonder if PyTorch has plans to implement the functionality to store model parameters in CPU instead of one main GPU? I’m training production scale data which is super big. If the GPU memory is used by 99% after mounting the model, there is little room for me to mount data anymore, which totally lose the scalability of multiple GPU training.


(Alban D) #8

Well you need to store model parameters on the gpu to be able to do the forward.
The backward computes gradients on the gpu so they will be there.
Not sure what you mean by “store model parameters in CPU instead of one main GPU” ?


(Yuzhou Song) #9

According to my understanding, one/multiple CPUs could serve as Parameter Server storing model parameters. In forward step, each GPU could mount a batch of data and part of model parameters (let’s say first 10 layers’), do the calculation. Then, abandon the first 10 layers’ and mount 11-20 layers’ and do calculation again. We repeat this until get the final output of forward(). The benefit is model size is no longer limited by GPU memory. I guess typical distributed training framework is implemented this way, right?

If my understanding is incorrect, what’s the best way to train a model with parameter size exceeds GPU memory?


(Alban D) #10

Hi,

You could do that but transferring few GB of data many times between the cpu and gpu will be quite slow.
But in practice, as I said above, what is going to take a lot of space are the intermediary results. And that is going to be even worst.

That being said, if your problem is that your net is too big, even for a batch size of 1, then you don’t want data parallelism but model parallelism.
This can be done very easily by adding some .cuda(1) then .cuda(2)… during your forward pass, and sending part of your net to different gpus. according to this split.
There is no builtin method to do this as it need to be done in your forward method directly and you need to ensure that the right part of each model is moved to the right gpu.


(Yuzhou Song) #11

That makes sense to me! It could be great if you guys can make a post/blog/sample code about what we discussing here. It must be very helpful!

Thanks a lot,
Best,
Yuzhou


(Juan F Montesinos) #12

Hi there @albanD, @Yuzhou_Song
I noticed there is an small mistake with the code you provided:
It’s necessary to unsqueeze loss inside forward pass to DataParallel were able to build loss back. Loss provided by PyTorch loss functions seems not to have dimensions, and DataParallel mount batch back in dim 0 by default. It also requires to class super class not to raise up with an error. Anyway thank you very much. Awesome help.

class FullModel(nn.Module):
  def __init__(self, model, loss):
    super(FullModel, self).__init__()
    self.model = model
    self.loss = loss

  def forward(self, targets, *inputs):
    outputs = self.model(*inputs)
    loss = self.loss(outputs, targets)
    return torch.unsqueeze(loss,0),outputs
    

def DataParallel_withLoss(model,loss,**kwargs):
    model=FullModel(model, loss)
    if 'device_ids' in kwargs.keys():
        device_ids=kwargs['device_ids']
    else:
        device_ids=None
    if 'output_device' in kwargs.keys():
        output_device=kwargs['output_device']
    else:
        output_device=None
    if 'cuda' in kwargs.keys():
        cudaID=kwargs['cuda'] 
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda(cudaID)
    else:
        model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda()
    return model

I was trying this implementation and it works.
However imbalance keeps happening

So, as far as i could check printing parameter.device model weights are stored in gpu by-default gpu. In the case I run, we can roughly say each mini-batch requires ~5Gb and model weights ~4Gb (8500-4500)
There is a gain in computing loss in a distributed way.

Is not possible to share model weights among all gpus? Cos in this example, i can increase batch size until gpu uses 12Gb, but it would mean gpu1 and gpu2 would be using only 8.

Even a worse case, if i were using Adam (you mentioned it copies all the model weigths), memory usage in GPU0 would be 12 Gb meanwhile gpu1 and gpu2 would be using 4.

If this were not possible due to pytorch requirements (I guess pytorch requires all tensors to be in the same gpu to be able to operate them)
Is there are way of penalizing workload, this is, an imbalanced amount of samples per gpu to reduce by-default gpu memory requirements?

Toy example
BS=30
gpu0–>4
gpu1–>13
gpu2–>13


(Alban D) #13

I am by no mean a specialist on DataParallel and I have a cpu-only install so I couldn’t test the code.
The doc states explicitly that if the module returns a 0dim tensor, then you will get a 1D tensor with size the number of gpus. That does not work? Maybe that’s only supported in the latest version.

Sharing the weights among gpus is not possible as they need to be on the device to perform computations.

I am not sure at the moment what creates this inbalance…
Maybe @smth knows? :slight_smile:


#14

I am able to reach a relatively balanced memory usage by incorporating the loss calculation in the forward pass. Thanks for the help
19%20PM


(Alban D) #15

Perfect !

That means that the inbalance that @JuanFMontesinos sees is due to something in his code that creates that. Not a problem with DataParallel.


(Juan F Montesinos) #16

Well glad to hear that, then i will find a way to solve it ^^ Ty @albanD for your help.
Just for discarding, how many parameters does your net have @soulless?
Cos I’m using a very deep multimodal net


#17

I am using a CNN autoencoder with around 8 million trainable parameters. How many parameters do you have?