I am training a model that uses about 10GB of memory. My GPU 11GB of ram. Training seems to progress fine for about 2 epochs (30,000 iterations) then I suddenly get this error. Why does it say that I do not have enough memory? Could it be something in the code that is slowly increasing ram usage ?
I just read about pin_memory and found out that I have it set to true in my dataloader. Could this be the most probable reason as to why I randomly get oom issues?
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/haziq/openpifpaf_crm_pose/openpifpaf/train.py", line 189, in <module>
main()
File "/home/haziq/openpifpaf_crm_pose/openpifpaf/train.py", line 185, in main
trainer.loop(train_loader, val_loader, args.epochs, start_epoch=start_epoch)
File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/trainer.py", line 91, in loop
self.train(train_scenes, epoch)
File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/trainer.py", line 190, in train
loss, head_losses = self.train_batch(data1, target1, meta1, data2, target2, meta2, apply_gradients)
File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/trainer.py", line 108, in train_batch
loss1, head_losses1 = self.loss(outputs1, targets1, head="pifpaf")
File "/home/haziq/env_openpifpaf/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/losses.py", line 101, in forward
for l, f, t in zip(self.losses_pifpaf, head_fields, head_targets)
File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/losses.py", line 102, in <listcomp>
for ll in l(f, t)]
File "/home/haziq/env_openpifpaf/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/losses.py", line 279, in forward
for x_scale, scale_to_kp in zip(x_scales, self.scales_to_kp)
File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/losses.py", line 279, in <listcomp>
for x_scale, scale_to_kp in zip(x_scales, self.scales_to_kp)
File "/home/haziq/env_openpifpaf/lib/python3.6/site-packages/torch/nn/functional.py", line 2231, in l1_loss
ret = torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: reduce failed to get memory buffer: out of memory