Segmentation fault problem while inferring images for larger batch size on CPU

morou · August 18, 2021, 9:22am

Hi ,
I have a segmentation fault problem while inferring images for larger batch size on CPU.
Environment:

CPU name: AMD EPYC 7763 64-Core Processor x 2
Docker image: nvcr.io/nvidia/pytorch:20.10-py3
torch: 1.7.0a0+7036e91
torchvision: 0.8.0a0
Setting:
Number of works:64
Batch size: [32, 64, 128, 256, 512]
Model: torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
Image: https://cocodataset.org/#explore?id=342322
To Reproduce simply:

import torch
import torchvision
import os
import cv2
device = torch.device('cpu')
batchsize = 258

model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
model.eval() 
model.to(device)

img_path = os.path.join('test_one_img', 'test.jpg')  
img = cv2.imread(img_path, -1)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = torchvision.transforms.functional.to_tensor(img)
img = list(img.to(device) for _ in range(batchsize))

output = model(img)

Result:
Batch size <128: work fine.
Batch size >256: segmentation fault
gdb debug infos:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
ROIAlignForward<float> (nthreads=nthreads@entry=90617856, input=0x7f5d13b50040, spatial_scale=@0x7ffcd15a953c: 0.25,
    channels=channels@entry=256, height=height@entry=200, width=width@entry=272, pooled_height=14, pooled_width=14, sampling_ratio=2,
    aligned=false, rois=0x562a6c532300, output=0x7f6e507aa040) at /tmp/pip-req-build-6hs294b4/torchvision/csrc/cpu/ROIAlign_cpu.cpp:199
199     /tmp/pip-req-build-6hs294b4/torchvision/csrc/cpu/ROIAlign_cpu.cpp: No such file or directory.

Any help is appreciated.
Thanks!

ptrblck · August 18, 2021, 7:11pm

Could you update PyTorch and torchvison to the latest stable release (or nightly) and check, if you are still seeing the issue, as the 1.7.0 pre-release is a bit older by now?