CUDA error: an illegal memory access was encountered - torch.ops.torchvision.roi_align

Sueone · September 8, 2021, 5:17pm

Hey there.
I’m trying to get RoIs by torch.ops.torchvision.roi_align with GPU but getting this CUDA error: an illegal memory access was encountered.
I’ve set my device to cuda 1 by torch.cuda.set_device(1).
Some posts say I should set torch.backends.cudnn.benchmark = False, but mine is already set False.

This is my code snippet.

aligned = []
for coor, feat_map in zip(coordinates, feature_maps):
     aligned.append(roi_align(feat_map, coor, output_size=7))

and I get the error from the coordinates[7]

My feature maps are {list:10} and the shape varies. (e.g. [1, 512, 12, 16], [1, 512, 11, 16], [1, 512, 16, 14]…)
This is the original roi_align codes and the error occurs at the return ~ line.

def roi_align(
    input: Tensor,
    boxes: Tensor,
    output_size: BroadcastingList2[int],
    spatial_scale: float = 1.0,
    sampling_ratio: int = -1,
    aligned: bool = False,
) -> Tensor:
         _assert_has_ops()
         check_roi_boxes_shape(boxes)
         rois = boxes
         output_size = _pair(output_size)
         if not isinstance(rois, torch.Tensor):
             rois = convert_boxes_to_roi_format(rois)
         return torch.ops.torchvision.roi_align(input, rois, spatial_scale,
                                           output_size[0], output_size[1],
                                           sampling_ratio, aligned)

I’ve runned the code with theCUDA_LAUNCH_BLOCKING=1also and got this.

❯ CUDA_LAUNCH_BLOCKING=1 python usage.py
Use load_from_local loader
The model and loaded state dict do not match exactly

unexpected key in source state_dict: roi_head.bbox_head.fc_cls.weight, roi_head.bbox_head.fc_reg.weight, roi_head.bbox_head.fc_reg.bias, roi_head.bbox_head.shared_fcs.0.weight, roi_head.bbox_head.shared_fcs.0.bias, roi_head.bbox_head.shared_fcs.1.weight, roi_head.bbox_head.shared_fcs.1.bias

/home/sueyeon/Projects/mmdetection/mmdet/datasets/utils.py:68: UserWarning: "ImageToTensor" pipeline is replaced by "DefaultFormatBundle" for batch inference. It is recommended to manually replace it in the test data pipeline in your config file.
  'data pipeline in your config file.', UserWarning)
Traceback (most recent call last):
  File "usage.py", line 284, in <module>
    cropped_gt_support_rois = crop_rois(revised_support_gt_coordinates, unchanged_img_supports)
  File "usage.py", line 174, in crop_rois
    aligned.append(roi_align(feat_map, coor, output_size=7))
  File "/home/sueyeon/anaconda3/envs/ctx/lib/python3.7/site-packages/torchvision/ops/roi_align.py", line 55, in roi_align
    sampling_ratio, aligned)
RuntimeError: CUDA error: an illegal memory access was encountered

To check if it is the GPU specific problem, I’ve unloaded all the tensors from GPU and runned the codes on the CPU and it worked! However it returned nan values and inf values which is not ideal… I need to train the model with GPUs anyway so I need to solve this problem Hope somebody please can help me out …

ptrblck · September 8, 2021, 5:20pm

Are you seeing the same issue if you are using the default device (GPU0)?
If not, it would sound like a missing device guard, which seems to be used here however.