Hi,
I am running into memory error while trying to re-train an FPN+FasterRCNN network in a similar fashion as shown here: https://github.com/pytorch/vision/blob/master/references/detection/train.py. My GPU cards are 3 x Nvidia 2080 Tis, Torchvision version: 0.4.0.
To get the above code working for my dataset and machine configuration(2GPU), I referred to these comments : [1] FasterRCNN and MaskRCNN doesn't work with DataParallel or DistributedDataParallel · Issue #25627 · pytorch/pytorch · GitHub
My training code successfully completes the first epoch. But fails with memory error in the second epoch. Complete Traceback:
(base) rsundara@igrida-abacus6:~/Code/fpnfrcnn_det$ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --world-size 2 --batch_size 2 --lr 1e-4
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
| distributed init (rank 1): env://
| distributed init (rank 0): env://
ARGS GPU is 0
Multi GPU training
Epoch: [1] [ 0/5944] eta: 23:42:17 lr: 0.000100 loss: 1.9594 (1.9594) loss_box_reg: 0.0522 (0.0522) loss_classifier: 0.7208 (0.7208) loss_objectness: 0.6930 (0.6930) loss_rpn_box_reg: 0.4934 (0.4934) time: 14.3570 data: 0.6149 max mem: 9659
Traceback (most recent call last):
File “train.py”, line 111, in
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=200)
File “/udd/rsundara/Code/fpnfrcnn_det/vision/engine.py”, line 46, in train_one_epoch
losses.backward()
File “/udd/rsundara/.local/lib/python3.6/site-packages/torch/tensor.py”, line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/udd/rsundara/.local/lib/python3.6/site-packages/torch/autograd/init.py”, line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 8.51 GiB (GPU 0; 10.73 GiB total capacity; 1.19 GiB already allocated; 8.51 GiB free; 109.16 MiB cached)
Traceback (most recent call last):
File “/soft/igrida/spack/opt/spack/linux-debian8-x86_64/gcc-9.1.0/python-3.6.5-xu2dmz5rdvjfmiizbt65hyy2hsqsn3ri/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/soft/igrida/spack/opt/spack/linux-debian8-x86_64/gcc-9.1.0/python-3.6.5-xu2dmz5rdvjfmiizbt65hyy2hsqsn3ri/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/udd/rsundara/.local/lib/python3.6/site-packages/torch/distributed/launch.py”, line 246, in
main()
File “/udd/rsundara/.local/lib/python3.6/site-packages/torch/distributed/launch.py”, line 242, in main
cmd=cmd)
subprocess.CalledProcessError: Command ‘[’/soft/igrida/spack/opt/spack/linux-debian8-x86_64/gcc-9.1.0/python-3.6.5-xu2dmz5rdvjfmiizbt65hyy2hsqsn3ri/bin/python’, ‘-u’, ‘train.py’, ‘–world-size’, ‘2’, ‘–batch_size’, ‘2’, ‘–lr’, ‘1e-4’]’ returned non-zero exit status 1.
The training however, successfully progresses without using the torch.utils.data.distributed.DistributedSampler
, i.e on a single GPU.
What could be going wrong with the MultiGPU training?
Regards,