Memory Error : FPN+FasterRCNN in parallel

Hi,

I am running into memory error while trying to re-train an FPN+FasterRCNN network in a similar fashion as shown here: https://github.com/pytorch/vision/blob/master/references/detection/train.py. My GPU cards are 3 x Nvidia 2080 Tis, Torchvision version: 0.4.0.

To get the above code working for my dataset and machine configuration(2GPU), I referred to these comments : [1] https://github.com/pytorch/pytorch/issues/25627#issuecomment-527992106

My training code successfully completes the first epoch. But fails with memory error in the second epoch. Complete Traceback:

(base) rsundara@igrida-abacus6:~/Code/fpnfrcnn_det$ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --world-size 2 --batch_size 2 --lr 1e-4


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


| distributed init (rank 1): env://
| distributed init (rank 0): env://
ARGS GPU is 0
Multi GPU training
Epoch: [1] [ 0/5944] eta: 23:42:17 lr: 0.000100 loss: 1.9594 (1.9594) loss_box_reg: 0.0522 (0.0522) loss_classifier: 0.7208 (0.7208) loss_objectness: 0.6930 (0.6930) loss_rpn_box_reg: 0.4934 (0.4934) time: 14.3570 data: 0.6149 max mem: 9659
Traceback (most recent call last):
File “train.py”, line 111, in
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=200)
File “/udd/rsundara/Code/fpnfrcnn_det/vision/engine.py”, line 46, in train_one_epoch
losses.backward()
File “/udd/rsundara/.local/lib/python3.6/site-packages/torch/tensor.py”, line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/udd/rsundara/.local/lib/python3.6/site-packages/torch/autograd/init.py”, line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 8.51 GiB (GPU 0; 10.73 GiB total capacity; 1.19 GiB already allocated; 8.51 GiB free; 109.16 MiB cached)
Traceback (most recent call last):
File “/soft/igrida/spack/opt/spack/linux-debian8-x86_64/gcc-9.1.0/python-3.6.5-xu2dmz5rdvjfmiizbt65hyy2hsqsn3ri/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/soft/igrida/spack/opt/spack/linux-debian8-x86_64/gcc-9.1.0/python-3.6.5-xu2dmz5rdvjfmiizbt65hyy2hsqsn3ri/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/udd/rsundara/.local/lib/python3.6/site-packages/torch/distributed/launch.py”, line 246, in
main()
File “/udd/rsundara/.local/lib/python3.6/site-packages/torch/distributed/launch.py”, line 242, in main
cmd=cmd)
subprocess.CalledProcessError: Command ‘[’/soft/igrida/spack/opt/spack/linux-debian8-x86_64/gcc-9.1.0/python-3.6.5-xu2dmz5rdvjfmiizbt65hyy2hsqsn3ri/bin/python’, ‘-u’, ‘train.py’, ‘–world-size’, ‘2’, ‘–batch_size’, ‘2’, ‘–lr’, ‘1e-4’]’ returned non-zero exit status 1.

The training however, successfully progresses without using the torch.utils.data.distributed.DistributedSampler, i.e on a single GPU.

What could be going wrong with the MultiGPU training?

Regards,

Here’s the second link I followed : [2] : Size mismatch when running FasterRCNN in parallel.

Due to the restriction on the link a new user could post, I couldn’t share it in my original post.

Based on the error message it seems you might be close to the memory limit in the first epoch, and creating some additional tensors might push you over the edge.
Did you already wrap the training and validation loops in separate methods?
This could save some memory, as Python uses function scoping and would free all unused tensors after the function scope is left.

Also, I assume you are wrapping the validation loop in a with torch.no_grad() block?

If that’s the case, then you could

  • reduce the batch size
  • decrease the number of parameters or activations in your model
  • use native mixed-precision with the nightly binaries or master build
  • trade compute for memory using torch.utils.checkpoint

Hi @ptrblck. Thank you for your response. I am replying to your questions as inline below,

Apparently not. I have ~8Gb of memory left in the first iteration on both the GPUs.

Yes, I don’t validate the model until the end of training as I’m Finetuning with pre-trained weights.

Yes, it’s same as torchvision.references.detection.engine.evaluate where it’s decorated with @torch.no_grad().

My batchsize is 2. It works quite well on single GPU case but just not for the multi-GPU case.

Could you please elaborate this?

I am curious why training on 2 GPUs fail while it progresses well on single GPU?

Thanks again for your time and help @ptrblck

It might be “bad luck” due to a change in memory fragmentation, as I still think you might be close to the limit:

Tried to allocate 8.51 GiB (GPU 0; 10.73 GiB total capacity; 1.19 GiB already allocated; 8.51 GiB free; 109.16 MiB cached)

so it seems that vert little memory might be missing.

The docs explain the general use case and this notebook gives you an example (the notebook is quite old by now, but should still give you an idea of the usage).

Hi @ptrblck thank you again for your reply. Is there a way by which I can check if the GPU memory doesn’t get cleared between subsequent iterations?(on the second GPU especially) I suspect this could be happening as the single GPU training works perfectly fine for me, even at batchsize=current_batch*8. It’s just the multi-GPU case which crashes, despite two same cards.

Sure, I will have a look, thanks!

Regards,

If you are running out of memory on a device, PyTorch will clear the cache and try to reallocate the memory. There is unfortunately not much you can check besides torch.cuda.memory_summary().

You could try to del unnecessary tensors early, so that you might get potentially more memory once you hit the OOM issue.

Okay, I think I have narrowed down the problem and realise your answer is pertinent @ptrblck. The network takes in images of varying sizes and not a fixed input, causing the memory to fluctuate.
I have managed to train the network on V100 GPU and here are the output from first two(200*2) iterations,

Epoch: [1] [ 0/7925] eta: 6:09:17 lr: 0.000500 loss: 2.0259 (2.0259) loss_classifier: 0.6635 (0.6635) loss_box_reg: 0.0236 (0.0236) loss_objectness: 0.6907 (0.6907) loss_rpn_box_reg: 0.6483 (0.6483) time: 2.7958 data: 1.1561 max mem: 7464
Epoch: [1] [ 200/7925] eta: 1:25:06 lr: 0.000500 loss: 1.0027 (1.3102) loss_classifier: 0.1189 (0.2118) loss_box_reg: 0.0081 (0.0198) loss_objectness: 0.5959 (0.6495) loss_rpn_box_reg: 0.3003 (0.4291) time: 0.6534 data: 0.0127 max mem: 10348

I have opened an issue this morning, a proposal to remove GeneralisedTransformation as a compulsory transformation to all FasterRCNN models and I believe fixed size inputs would keep the memory from fluctuating.

Thanks again!

That’s interesting. Would it be possible to somehow sort the images so that you would start with a large one and using smaller inputs in the following iterations?
This could probably reuse the memory without reallocating new memory.
Maybe you’ll be able to lower the memory footprint using it.