DDP taking up too much memory on rank 0

Edit: Mistaken!
This was my issue:

Here’s a screenshot of distributed training in Pytorch when I call the train function like:
CUDA_VISIBLE_DEVICES=1,2,3,4 python -m torch.distributed.launch --nproc_per_node=4 train_new.py. You can see that the first rank has also initted 3 separate processes for each other GPU. When I use 10 GPUs on a box this severely limits the batch size, since the 0th dimension node has so much less capacity. What is it storing? I thought gradients in DDP were all-reduced. I’ve also tried turning broadcast_buffers to False to no avail.
Model is stacked modules of 1D-conv, relu, batch norm, LSTM, followed by a large softmax layer and CTC loss. Backend is NCCL
Pytorch 1.3.0, Cuda 10.1, Titan RTX, Ubuntu 18.04. Can provide more code upon request.

1 Like

any solution?
github issue solution does not work for me

Discussion here might be helpful.

This is likely due to some tensors/context is unintentionally created on the 1st GPU, e.g., when calling torch.cuda.empty_cache() without a device guard. Solutions would be either 1) carefully walking though libs/codes to make sure no states leaks to cuda:0, or 2) set CUDA_VISIBLE_DEVICES to let each process only see one GPU.The second approach might be easier.

@PCerles I’m having a similar issue. Were you able to resolve your problem? Thanks.

@PCerles @Felix_Kreuk

What @mrshenli mentioned could seamlessly happen when you load saved parameters without specifying map_location.
torch.load by default loads parameters to the device where they were, usually the rank 0 device.
load_state_dict then copies the loaded value from that device to the target device.
After the intermediate use, torch still occupies the GPU memory as cached memory.
I had a similar issue and solved it by directly loading parameters to the target device.

For example:

state_dict = torch.load(model_name, map_location=self.args.device)

Full code here.