CUDA OOM while using torch.distributed.launch and no OOM training without it

nullkatar · September 11, 2020, 12:37pm

Hello folks!

I’m stuck with one very strange problem: I work with recently released Scene Graph Benchmark and made it train on GQA, but I have one issue with that. It trains as expected when I use the following command: CUDA_VISIBLE_DEVICES=0,1,2,3 python tools/relation_train_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" (it uses batch_size = 2). But when I decide to use another command with torch.distributed.launch (CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --master_port 10025 --nproc_per_node=1 tools/relation_train_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml") I’m getting RuntimeError: CUDA out of memory( it still has batch_size = 2). Initially I wanted to train it on 4 GPUs with batch_size = 8, but figured out about this problem. What can be the problem? And what should I do in order to properly train it on 4 GPUs?

My set up includes 4 2080ti so It has a plenty of memory.

agolynski · September 11, 2020, 5:07pm

nullkatar:

I’m stuck with one very strange problem: I work with recently released Scene Graph Benchmark and made it train on GQA, but I have one issue with that. It trains as expected when I use the following command: CUDA_VISIBLE_DEVICES=0,1,2,3 python tools/relation_train_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" (it uses batch_size = 2 ). But when I decide to use another command with torch.distributed.launch ( CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --master_port 10025 --nproc_per_node=1 tools/relation_train_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" ) I’m getting RuntimeError: CUDA out of memory ( it still has batch_size = 2 ). Initially I wanted to train it on 4 GPUs with batch_size = 8 , but figured out about this problem. What can be the problem? And what should I do in order to properly train it on 4 GPUs?

Hi Leon,

What version of pytorch do you use?

nullkatar · September 11, 2020, 10:00pm

Hello Alexander,

1.6.0, one which installs with conda install torch.

ptrblck · September 15, 2020, 7:58am

Could you try to run the DDP command on a single node and GPU and check the memory usage?
I guess the code might create unnecessary CUDA contexts on other devices, but since the repository contains a lot of files I haven’t looked through all of them.