I got a strange problem when training models:
Traceback (most recent call last):
File "tools/train.py", line 243, in <module>
main()
File "tools/train.py", line 227, in main
meta=meta)
File "mmdet3d/apis/train.py", line 34, in train_model
meta=meta)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/mmdet/apis/train.py", line 79, in train_detector
model.cuda(),
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 637, in cuda
return self._apply(lambda t: t.cuda(device))
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 552, in _apply
param_applied = fn(param)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 637, in <lambda>
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory
The OOM error is raised when moving the model to GPU (model.cuda()). There is no GPU resources competition.
Since I use the standard training config, it is impossible to produce such a model large enough to consume ~11 GB of memory.
The issue seems related to the running environment, because it disappears after switching to another server (with the same image).
Though it is solved, I have no idea for the real reasons. I can’t understand it at all.
Does anyone know the reasons?