I had tried put the loss inside of forward() before. It did mitigate this imbalanced issue but not fully resolved it. BTW, I wonder if PyTorch has plans to implement the functionality to store model parameters in CPU instead of one main GPU? I’m training production scale data which is super big. If the GPU memory is used by 99% after mounting the model, there is little room for me to mount data anymore, which totally lose the scalability of multiple GPU training.
Well you need to store model parameters on the gpu to be able to do the forward.
The backward computes gradients on the gpu so they will be there.
Not sure what you mean by “store model parameters in CPU instead of one main GPU” ?
According to my understanding, one/multiple CPUs could serve as Parameter Server storing model parameters. In forward step, each GPU could mount a batch of data and part of model parameters (let’s say first 10 layers’), do the calculation. Then, abandon the first 10 layers’ and mount 11-20 layers’ and do calculation again. We repeat this until get the final output of forward(). The benefit is model size is no longer limited by GPU memory. I guess typical distributed training framework is implemented this way, right?
If my understanding is incorrect, what’s the best way to train a model with parameter size exceeds GPU memory?
You could do that but transferring few GB of data many times between the cpu and gpu will be quite slow.
But in practice, as I said above, what is going to take a lot of space are the intermediary results. And that is going to be even worst.
That being said, if your problem is that your net is too big, even for a batch size of 1, then you don’t want data parallelism but model parallelism.
This can be done very easily by adding some
.cuda(2)… during your forward pass, and sending part of your net to different gpus. according to this split.
There is no builtin method to do this as it need to be done in your forward method directly and you need to ensure that the right part of each model is moved to the right gpu.
That makes sense to me! It could be great if you guys can make a post/blog/sample code about what we discussing here. It must be very helpful!
Thanks a lot,
Hi there @albanD, @Yuzhou_Song
I noticed there is an small mistake with the code you provided:
It’s necessary to unsqueeze loss inside forward pass to DataParallel were able to build loss back. Loss provided by PyTorch loss functions seems not to have dimensions, and DataParallel mount batch back in dim 0 by default. It also requires to class super class not to raise up with an error. Anyway thank you very much. Awesome help.
class FullModel(nn.Module): def __init__(self, model, loss): super(FullModel, self).__init__() self.model = model self.loss = loss def forward(self, targets, *inputs): outputs = self.model(*inputs) loss = self.loss(outputs, targets) return torch.unsqueeze(loss,0),outputs def DataParallel_withLoss(model,loss,**kwargs): model=FullModel(model, loss) if 'device_ids' in kwargs.keys(): device_ids=kwargs['device_ids'] else: device_ids=None if 'output_device' in kwargs.keys(): output_device=kwargs['output_device'] else: output_device=None if 'cuda' in kwargs.keys(): cudaID=kwargs['cuda'] model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda(cudaID) else: model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda() return model
I was trying this implementation and it works.
However imbalance keeps happening
So, as far as i could check printing parameter.device model weights are stored in gpu by-default gpu. In the case I run, we can roughly say each mini-batch requires ~5Gb and model weights ~4Gb (8500-4500)
There is a gain in computing loss in a distributed way.
Is not possible to share model weights among all gpus? Cos in this example, i can increase batch size until gpu uses 12Gb, but it would mean gpu1 and gpu2 would be using only 8.
Even a worse case, if i were using Adam (you mentioned it copies all the model weigths), memory usage in GPU0 would be 12 Gb meanwhile gpu1 and gpu2 would be using 4.
If this were not possible due to pytorch requirements (I guess pytorch requires all tensors to be in the same gpu to be able to operate them)
Is there are way of penalizing workload, this is, an imbalanced amount of samples per gpu to reduce by-default gpu memory requirements?
I am by no mean a specialist on DataParallel and I have a cpu-only install so I couldn’t test the code.
The doc states explicitly that if the module returns a 0dim tensor, then you will get a 1D tensor with size the number of gpus. That does not work? Maybe that’s only supported in the latest version.
Sharing the weights among gpus is not possible as they need to be on the device to perform computations.
I am not sure at the moment what creates this inbalance…
Maybe @smth knows?
I am able to reach a relatively balanced memory usage by incorporating the loss calculation in the forward pass. Thanks for the help
That means that the inbalance that @JuanFMontesinos sees is due to something in his code that creates that. Not a problem with DataParallel.
I am using a CNN autoencoder with around 8 million trainable parameters. How many parameters do you have?
I guess, not confirmed yet, that if we save/load a model and this model is directly stored from gpu, the memory is not freed.
I mean, I’ve been working is several clusters and saving model directly from dataparallel module. I’ve had all these troubles loading pretrained weights from this dataparallel state dic. However I found that saving the model prepared for cpu, loading the model again and then moving it to cuda uses less memory.
Do you mean it would save more memory by moving the dataparallel model back to the cpu model before saving it ?
Do you by any chance still have the script around for which you made this work?
I am trying to do what albanD suggested above, but it just won’t work. It’d be really nice to have an example to look at.
You should be able to use, maybe with minor modifications, the one I posted above. It works, at least in 0.4.0
#!/usr/bin/env python3 # -*- coding: utf-8 -*- import torch import torch.nn as nn class FullModel(nn.Module): def __init__(self, model, loss): super(FullModel, self).__init__() self.model = model self.loss = loss def forward(self, targets, *inputs): outputs = self.model(*inputs) loss = self.loss(outputs, targets) return torch.unsqueeze(loss,0),outputs def DataParallel_withLoss(model,loss,**kwargs): model=FullModel(model, loss) if 'device_ids' in kwargs.keys(): device_ids=kwargs['device_ids'] else: device_ids=None if 'output_device' in kwargs.keys(): output_device=kwargs['output_device'] else: output_device=None if 'cuda' in kwargs.keys(): cudaID=kwargs['cuda'] model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda(cudaID) else: model=torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda() return model class toy(nn.Module): def __init__(self): super(toy, self).__init__() self.conv2d = torch.nn.Conv2d(1,3,1) def forward(self,x): return self.conv2d(x) model = toy() optimizer = torch.optim.SGD(model.parameters(),lr=1) loss = torch.nn.L1Loss() model = DataParallel_withLoss(model,loss) gt = torch.rand(2,3,10,10) input = torch.rand(2,1,10,10) loss,_ = model(gt,input) loss = loss.sum() optimizer.zero_grad() loss.backward() optimizer.step()
The toy example is brevity.
According to your example, I reproduced it in my codes . However, there is a weird bug. Do you know what cause this ?
Traceback (most recent call last): File "train.py", line 363, in <module> main(args) File "train.py", line 97, in main train(args, trainer, task, epoch_itr) File "train.py", line 135, in train log_output = trainer.train_step(sample, update_params=True) File "/data/mmyin/tf-datapallelism/fairseq/trainer.py", line 120, in train_step loss, sample_size, logging_output, oom_fwd = self._forward(sample) File "/data/mmyin/tf-datapallelism/fairseq/trainer.py", line 212, in _forward oss, sample_size, logging_output_ = self.full_model(sample) File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward return self.gather(outputs, self.output_device) File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 136, in gather return gather(outputs, output_device, dim=self.dim) File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather return gather_map(outputs) File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: zip argument #1 must support iteration
Well that error means that zip requires itarable inputs, such as list or tuples. Therefore it is taking as input something which is non iterable. What exactly? I don’t know since I don’t have your implementation
Thanks for your advice.
I have a question about final loss. Because I found the final loss is not sum of a batch, but a list containing mini batch loss.
For example, a big batch is divided into four parts for computing on 4 GPUs, while
loss,_ = model(gt,input) will return a list containing four partial losses.
Is that correct and does need to sum them up manually?
Yeh sorry, you should sum them before applying loss backward.
I rewrote the toy example with a 2dconv
We had the same issue, in that we could only train with a much smaller batch size when parallelizing.
Using DistributedDataParallel in both model and loss got us much better results. You have to use DistributedSampler and init_process_group, but it’s all in this example: https://github.com/pytorch/examples/blob/master/imagenet/main.py
However, we have not seen massive improvements in speed, probably due to our slow dataloader/data transfer as our input size is quite large…
Both methods, DistributedDataParallel or DataParallel, running on a AWS P3 with 8 GPUs barely improved at all compared to a single GPU (perhaps the variation on the time required for an epoch is reduced, but the average time is about the same). That doesn’t make much sense, has anyone seen the same problem?