CUDA OOM while using torch.distributed.launch and no OOM training without it

Hello folks!

I’m stuck with one very strange problem: I work with recently released Scene Graph Benchmark and made it train on GQA, but I have one issue with that. It trains as expected when I use the following command: CUDA_VISIBLE_DEVICES=0,1,2,3 python tools/ --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" (it uses batch_size = 2). But when I decide to use another command with torch.distributed.launch (CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --master_port 10025 --nproc_per_node=1 tools/ --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml") I’m getting RuntimeError: CUDA out of memory( it still has batch_size = 2). Initially I wanted to train it on 4 GPUs with batch_size = 8, but figured out about this problem. What can be the problem? And what should I do in order to properly train it on 4 GPUs?

My set up includes 4 2080ti so It has a plenty of memory.

Hi Leon,

What version of pytorch do you use?

Hello Alexander,

1.6.0, one which installs with conda install torch.

Could you try to run the DDP command on a single node and GPU and check the memory usage?
I guess the code might create unnecessary CUDA contexts on other devices, but since the repository contains a lot of files I haven’t looked through all of them.