Hi,
Here is the error message:
Traceback (most recent call last):
File "/home/shiqian/zhangzh/FAFRCNN/train.py", line 28, in <module>
train()
File "/home/shiqian/zhangzh/FAFRCNN/train.py", line 24, in train
transfer_train_fast('/home/shiqian/zhangzh/FAFRCNN/checkpoints/fs_40tr_27.5con_d1_0.75l-0.20a-1.00b-1.00/03090209_0.310')
File "/home/shiqian/zhangzh/FAFRCNN/transfer_train_fast.py", line 82, in transfer_train_fast
adv_meters, meters, ins_adv_optimizer, img_adv_optimizer, train_num=cfg.dis_train_num)
File "/home/shiqian/zhangzh/FAFRCNN/dis_train.py", line 61, in dis_train
faster_rcnn_mp.rpn_features(t_images, t_features, t_targets)
File "/home/shiqian/zhangzh/FAFRCNN/model/faster_rcnn/faster_rcnn_model_parallel.py", line 269, in rpn_features
proposals, proposal_losses, iou, anchor_len = self.rpn(images, features, targets)
File "/home/shiqian/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/shiqian/zhangzh/FAFRCNN/model/faster_rcnn/faster_rcnn_model_parallel.py", line 344, in forward
objectness.to(cfg.gpu5), pred_bbox_deltas.to(cfg.gpu5), labels, regression_targets)
File "/home/shiqian/.local/lib/python3.5/site-packages/torchvision/models/detection/rpn.py", line 371, in compute_loss
reduction="sum",
File "/home/shiqian/.local/lib/python3.5/site-packages/torch/nn/functional.py", line 2179, in l1_loss
ret = torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: std::bad_alloc: temporary_buffer::allocate: get_temporary_buffer failed
I encountered this error many times. It always happens at RPN’s loss computation. Sometimes I can train my model for many epochs without this error, but sometimes it occurred at the second epoch. I am training my model on a server, which should have enough memory. This is the output of free -h
in bash:
total used free shared buff/cache available
Mem: 376G 103G 79G 10G 193G 259G
Swap: 7.6G 5.4G 2.2G
I modified the FasterRCNN
in pytorch to enable model parallel manually, other components are basically the same.