Out of memory when running model.cuda()

fanl · September 21, 2021, 2:04pm

I got a strange problem when training models:

Traceback (most recent call last):
  File "tools/train.py", line 243, in <module>
    main()
  File "tools/train.py", line 227, in main
    meta=meta)
  File "mmdet3d/apis/train.py", line 34, in train_model
    meta=meta)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/mmdet/apis/train.py", line 79, in train_detector
    model.cuda(),
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 637, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 552, in _apply
    param_applied = fn(param)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 637, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory

The OOM error is raised when moving the model to GPU (model.cuda()). There is no GPU resources competition.
Since I use the standard training config, it is impossible to produce such a model large enough to consume ~11 GB of memory.
The issue seems related to the running environment, because it disappears after switching to another server (with the same image).
Though it is solved, I have no idea for the real reasons. I can’t understand it at all.
Does anyone know the reasons?

ptrblck · September 21, 2021, 9:44pm

Based on your description the error message sounds like a red herring and you might indeed be hitting a setup issue. I don’t know what differences are between the environments, so can’t comment on a potential root cause.
In the past I’ve seen “random” CUDA errors if you were already hitting a sticky CUDA error, the CUDA context would thus be invalid, and would then try to run more CUDA operations.

fanl · September 22, 2021, 6:05am

Hi
I think you are right, and it is indeed a red herring. The root cause might be related to hardware. The following code snippet can reproduce the bug:

>>> import torch
>>> torch.rand(1).cuda(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
>>> torch.rand(1).cuda(1)
tensor([0.7275], device='cuda:1')

It seems that there are some problems with GPU 0. Do you have any idea?

ptrblck · September 22, 2021, 6:07am

Try to reboot the machine first (or reset the device) and see, if this could help.
Hardware issues are certainly possible, but the majority of issues is usually created on the software side.