RuntimeError: std::bad_alloc: temporary_buffer::allocate: get_temporary_buffer failed in RPN

Hi,
Here is the error message:

Traceback (most recent call last):
  File "/home/shiqian/zhangzh/FAFRCNN/train.py", line 28, in <module>
    train()
  File "/home/shiqian/zhangzh/FAFRCNN/train.py", line 24, in train
    transfer_train_fast('/home/shiqian/zhangzh/FAFRCNN/checkpoints/fs_40tr_27.5con_d1_0.75l-0.20a-1.00b-1.00/03090209_0.310')
  File "/home/shiqian/zhangzh/FAFRCNN/transfer_train_fast.py", line 82, in transfer_train_fast
    adv_meters, meters, ins_adv_optimizer, img_adv_optimizer, train_num=cfg.dis_train_num)
  File "/home/shiqian/zhangzh/FAFRCNN/dis_train.py", line 61, in dis_train
    faster_rcnn_mp.rpn_features(t_images, t_features, t_targets)
  File "/home/shiqian/zhangzh/FAFRCNN/model/faster_rcnn/faster_rcnn_model_parallel.py", line 269, in rpn_features
    proposals, proposal_losses, iou, anchor_len = self.rpn(images, features, targets)
  File "/home/shiqian/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/shiqian/zhangzh/FAFRCNN/model/faster_rcnn/faster_rcnn_model_parallel.py", line 344, in forward
    objectness.to(cfg.gpu5), pred_bbox_deltas.to(cfg.gpu5), labels, regression_targets)
  File "/home/shiqian/.local/lib/python3.5/site-packages/torchvision/models/detection/rpn.py", line 371, in compute_loss
    reduction="sum",
  File "/home/shiqian/.local/lib/python3.5/site-packages/torch/nn/functional.py", line 2179, in l1_loss
    ret = torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: std::bad_alloc: temporary_buffer::allocate: get_temporary_buffer failed

I encountered this error many times. It always happens at RPN’s loss computation. Sometimes I can train my model for many epochs without this error, but sometimes it occurred at the second epoch. I am training my model on a server, which should have enough memory. This is the output of free -h in bash:

              total        used        free      shared  buff/cache   available
Mem:           376G        103G         79G         10G        193G        259G
Swap:          7.6G        5.4G        2.2G

I modified the FasterRCNN in pytorch to enable model parallel manually, other components are basically the same.