I trained my model with two GPU by DDP()
, it’s work well.
But today it always show “CUDA out of memory
” when training step and always stop in same train iter (247).
I checked the memory by wathch -n 1 nvidia-smi
and it was normal up to 247 iter. At 247 iter, the memory would suddenly increase until it exceeded 32GB.
2022-08-16 14:15:03.840 | INFO | utils.fit:loss_log:89 - loss_all: 0.000, heatmap_loss: 0.000, xy_offset_loss: 0.000, z_offset_loss: 0.000, wlh_loss: 0.000, angle_bin: 0.000, angle_offset: 0.000, cls_loss: 0.000
2022-08-16 14:15:03.841 | INFO | utils.fit:training_step:185 - Parameter containing:
tensor([[-0.2428, 0.0316, -0.2120, 0.2174, 0.2807, 0.2077, 0.0292, 0.1832,
0.1201],
[-0.2144, 0.0829, -0.0840, -0.0549, -0.0242, 0.1138, -0.2576, -0.0103,
0.3143],
[ 0.0353, 0.0268, -0.2982, 0.0269, -0.0893, 0.0274, 0.1253, 0.1671,
-0.3090]], device='cuda:0', requires_grad=True)
2022-08-16 14:15:03.843 | INFO | utils.fit:training_step:186 - epoch: 72/200, iter: 247/906 optim_lr: 0.00010000000000000002, sche_lr: 0.00010000000000000002
time cost : 3.46462 sec
Traceback (most recent call last):
File "3D_train.py", line 242, in <module>
trainer.train()
File "3D_train.py", line 235, in train
self.fit_func.fit(epoch)
File "/data/lianghao/lidar_and_4D_imaging_radar_fusion_demo/3D-MAN-reproduction/utils/fit.py", line 354, in fit
all_loss = self.training_step(epoch)
File "/data/lianghao/lidar_and_4D_imaging_radar_fusion_demo/3D-MAN-reproduction/utils/fit.py", line 150, in training_step
output = self.model(lidar_pillar, self.opts) # time cost : 0.03377 sec
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/lianghao/lidar_and_4D_imaging_radar_fusion_demo/3D-MAN-reproduction/model/FSD_module.py", line 123, in forward
pesu_img = self.lidar_branch(pillars)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/lianghao/lidar_and_4D_imaging_radar_fusion_demo/3D-MAN-reproduction/model/FSD_module.py", line 81, in forward
x = self.PFNLayer(x)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/lianghao/lidar_and_4D_imaging_radar_fusion_demo/3D-MAN-reproduction/model/FSD_module.py", line 64, in forward
x = self.norm(x)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward
self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 2058, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 1.83 GiB (GPU 0; 31.75 GiB total capacity; 28.34 GiB already allocated; 106.50 MiB free; 29.10 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '3D_train.py', '--local_rank=1']' returned non-zero exit status 1.
I resume my checkpoint file so the start epoch is 72.