I’m stuck with one very strange problem: I work with recently released Scene Graph Benchmark and made it train on GQA, but I have one issue with that. It trains as expected when I use the following command: CUDA_VISIBLE_DEVICES=0,1,2,3 python tools/relation_train_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" (it uses batch_size = 2). But when I decide to use another command with torch.distributed.launch (CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --master_port 10025 --nproc_per_node=1 tools/relation_train_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml") I’m getting RuntimeError: CUDA out of memory( it still has batch_size = 2). Initially I wanted to train it on 4 GPUs with batch_size = 8, but figured out about this problem. What can be the problem? And what should I do in order to properly train it on 4 GPUs?
My set up includes 4 2080ti so It has a plenty of memory.
Could you try to run the DDP command on a single node and GPU and check the memory usage?
I guess the code might create unnecessary CUDA contexts on other devices, but since the repository contains a lot of files I haven’t looked through all of them.