Here is the error message:
Traceback (most recent call last): File "/home/shiqian/zhangzh/FAFRCNN/train.py", line 28, in <module> train() File "/home/shiqian/zhangzh/FAFRCNN/train.py", line 24, in train transfer_train_fast('/home/shiqian/zhangzh/FAFRCNN/checkpoints/fs_40tr_27.5con_d1_0.75l-0.20a-1.00b-1.00/03090209_0.310') File "/home/shiqian/zhangzh/FAFRCNN/transfer_train_fast.py", line 82, in transfer_train_fast adv_meters, meters, ins_adv_optimizer, img_adv_optimizer, train_num=cfg.dis_train_num) File "/home/shiqian/zhangzh/FAFRCNN/dis_train.py", line 61, in dis_train faster_rcnn_mp.rpn_features(t_images, t_features, t_targets) File "/home/shiqian/zhangzh/FAFRCNN/model/faster_rcnn/faster_rcnn_model_parallel.py", line 269, in rpn_features proposals, proposal_losses, iou, anchor_len = self.rpn(images, features, targets) File "/home/shiqian/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/shiqian/zhangzh/FAFRCNN/model/faster_rcnn/faster_rcnn_model_parallel.py", line 344, in forward objectness.to(cfg.gpu5), pred_bbox_deltas.to(cfg.gpu5), labels, regression_targets) File "/home/shiqian/.local/lib/python3.5/site-packages/torchvision/models/detection/rpn.py", line 371, in compute_loss reduction="sum", File "/home/shiqian/.local/lib/python3.5/site-packages/torch/nn/functional.py", line 2179, in l1_loss ret = torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction)) RuntimeError: std::bad_alloc: temporary_buffer::allocate: get_temporary_buffer failed
I encountered this error many times. It always happens at RPN’s loss computation. Sometimes I can train my model for many epochs without this error, but sometimes it occurred at the second epoch. I am training my model on a server, which should have enough memory. This is the output of
free -h in bash:
total used free shared buff/cache available Mem: 376G 103G 79G 10G 193G 259G Swap: 7.6G 5.4G 2.2G
I modified the
FasterRCNN in pytorch to enable model parallel manually, other components are basically the same.