DDP taking up too much memory on rank 0

PCerles · November 28, 2019, 4:28pm

Edit: Mistaken!
This was my issue:

github.com/pytorch/pytorch

DistributedDataParallel: resume training from a checkpoint results in additional processes on GPU 0

opened 01:23AM - 21 Jul 19 UTC

closed 03:00AM - 21 Jul 19 UTC

qchenclaire

oncall: distributed

Hi, When I was trying > +--------------- > | Processes: > | GPU > |=============== > | 0 62250 > | 0 62251 > | 0 62252 > | 0 62253 > | 1 62251 > | 2 62252 > | 3 62253 > > If I train from > +--------------- > | Processes: > | GPU > |=============== > | 0 > | 1 > | 2 > | 3 > the[ imagenet example](https://github.com/pytorch/examp…les/tree/master/imagenet) with DistributedDataParallel, using single node with 4 gpus, I found that when I add `--resume /path/to/checkpoint `to the command, gpu 0 has additional processes like below, the pid of each process is exactly the same as running on every other gpu, and each process consumes ~725M. These additional processes keep lingering on gpu 0 instead of disapear when loading finished. Is there any way to solve this? If there are 8 gpus, that means ~5G on gpu 0 are wasted. --------------------------------------------------------------+ GPU Memory | PID Type Process name Usage | ==============================================================| C /usr/bin/python3.6 4903MiB | C /usr/bin/python3.6 725MiB | C /usr/bin/python3.6 725MiB | C /usr/bin/python3.6 725MiB | C /usr/bin/python3.6 4693MiB | C /usr/bin/python3.6 4693MiB | C /usr/bin/python3.6 4693MiB | scratch, there is no such issue. --------------------------------------------------------------+ GPU Memory | PID Type Process name Usage | ==============================================================| 2080 C /usr/bin/python3.6 4687MiB | 2081 C /usr/bin/python3.6 4687MiB | 2082 C /usr/bin/python3.6 4687MiB | 2083 C /usr/bin/python3.6 4687MiB |

Here’s a screenshot of distributed training in Pytorch when I call the train function like:
CUDA_VISIBLE_DEVICES=1,2,3,4 python -m torch.distributed.launch --nproc_per_node=4 train_new.py. You can see that the first rank has also initted 3 separate processes for each other GPU. When I use 10 GPUs on a box this severely limits the batch size, since the 0th dimension node has so much less capacity. What is it storing? I thought gradients in DDP were all-reduced. I’ve also tried turning broadcast_buffers to False to no avail.
Model is stacked modules of 1D-conv, relu, batch norm, LSTM, followed by a large softmax layer and CTC loss. Backend is NCCL
Pytorch 1.3.0, Cuda 10.1, Titan RTX, Ubuntu 18.04. Can provide more code upon request.

songzw · June 20, 2020, 5:13am

any solution?
github issue solution does not work for me

mrshenli · June 22, 2020, 3:07pm

Discussion here might be helpful.

This is likely due to some tensors/context is unintentionally created on the 1st GPU, e.g., when calling torch.cuda.empty_cache() without a device guard. Solutions would be either 1) carefully walking though libs/codes to make sure no states leaks to cuda:0, or 2) set CUDA_VISIBLE_DEVICES to let each process only see one GPU.The second approach might be easier.

Felix_Kreuk · September 15, 2020, 12:58pm

@PCerles I’m having a similar issue. Were you able to resolve your problem? Thanks.

seungjun · September 16, 2020, 5:47am

@PCerles @Felix_Kreuk

What @mrshenli mentioned could seamlessly happen when you load saved parameters without specifying map_location.
torch.load by default loads parameters to the device where they were, usually the rank 0 device.
load_state_dict then copies the loaded value from that device to the target device.
After the intermediate use, torch still occupies the GPU memory as cached memory.
I had a similar issue and solved it by directly loading parameters to the target device.

For example:

state_dict = torch.load(model_name, map_location=self.args.device)
self.load_state_dict(state_dict)

Full code here.