Pytorch 1.8.0 fasterrcnn_resnet50_fpn error

Kitsunetic · March 9, 2021, 3:45am

My environment is

OS: Ubuntu 18.04
GPU: RTX3090
CUDA: CUDA11.2
Pytorch 1.8.0_with_CUDA11.1 stable

from torchvision.models.detection import fasterrcnn_resnet50_fpn

box_model = fasterrcnn_resnet50_fpn(pretrained=True, progress=False).cuda()
xs = torch.rand(2, 3, 1080, 1920, dtype=torch.float32).cuda()
ys = [
  {
    "labels": torch.tensor([1], dtype=torch.int64).cuda(),
    "boxes": torch.tensor([[956.0000, 316.3117, 1134.0000, 838.8275]], 
                          dtype=torch.float32).cuda(),
  },
  {
    "labels": torch.tensor([1], dtype=torch.int64).cuda(),
    "boxes": torch.tensor([[956.0000, 316.3117, 1134.0000, 838.8275]], 
                          dtype=torch.float32).cuda(),
  },
]

box_model(xs, ys)

It occurs error like this.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-13-7f582a050256> in <module>
----> 1 box_model(xs, ys)

~/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/anaconda3/envs/torch/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py in forward(self, images, targets)
     95         if isinstance(features, torch.Tensor):
     96             features = OrderedDict([('0', features)])
---> 97         proposals, proposal_losses = self.rpn(images, features, targets)
     98         detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
     99         detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

~/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/anaconda3/envs/torch/lib/python3.7/site-packages/torchvision/models/detection/rpn.py in forward(self, images, features, targets)
    363             regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
    364             loss_objectness, loss_rpn_box_reg = self.compute_loss(
--> 365                 objectness, pred_bbox_deltas, labels, regression_targets)
    366             losses = {
    367                 "loss_objectness": loss_objectness,

~/anaconda3/envs/torch/lib/python3.7/site-packages/torchvision/models/detection/rpn.py in compute_loss(self, objectness, pred_bbox_deltas, labels, regression_targets)
    294         """
    295 
--> 296         sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
    297         sampled_pos_inds = torch.where(torch.cat(sampled_pos_inds, dim=0))[0]
    298         sampled_neg_inds = torch.where(torch.cat(sampled_neg_inds, dim=0))[0]

~/anaconda3/envs/torch/lib/python3.7/site-packages/torchvision/models/detection/_utils.py in __call__(self, matched_idxs)
     55             # randomly select positive and negative examples
     56             perm1 = torch.randperm(positive.numel(), device=positive.device)[:num_pos]
---> 57             perm2 = torch.randperm(negative.numel(), device=negative.device)[:num_neg]
     58 
     59             pos_idx_per_image = positive[perm1]

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal

When I try this code on CPU, it works fine.
After that, I reinstalled Pytorch 1.7.1_with_CUDA11.0 stable, it works fine too.

albanD · March 9, 2021, 5:01pm

cc @ptrblck do you know where this could be coming from?

ptrblck · March 10, 2021, 12:16am

Haven’t seen this error, but let me reproduce it on a 3090.

EDIT: I was able to reproduce it with the 1.8.0+CUDA11.1 conda binaries and will debug it further.
It’s not failing in a source build, so my first guess is to look into CUB/Thrust.

namirinz · March 13, 2021, 10:00am

Did you just solve the problem. I faces the same now.

~/torch_env/lib/python3.8/site-packages/torchvision/models/detection/rpn.py in forward(self, images, features, targets)
    362             labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
    363             regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
--> 364             loss_objectness, loss_rpn_box_reg = self.compute_loss(
    365                 objectness, pred_bbox_deltas, labels, regression_targets)
    366             losses = {

~/torch_env/lib/python3.8/site-packages/torchvision/models/detection/rpn.py in compute_loss(self, objectness, pred_bbox_deltas, labels, regression_targets)
    294         """
    295 
--> 296         sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
    297         sampled_pos_inds = torch.where(torch.cat(sampled_pos_inds, dim=0))[0]
    298         sampled_neg_inds = torch.where(torch.cat(sampled_neg_inds, dim=0))[0]

~/torch_env/lib/python3.8/site-packages/torchvision/models/detection/_utils.py in __call__(self, matched_idxs)
     55             # randomly select positive and negative examples
     56             perm1 = torch.randperm(positive.numel(), device=positive.device)[:num_pos]
---> 57             perm2 = torch.randperm(negative.numel(), device=negative.device)[:num_neg]
     58 
     59             pos_idx_per_image = positive[perm1]

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal

OS: Ubuntu 20.04
GPU: RTX 3080
package: pytorch-1.8.0 with CUDA 11.1

Kitsunetic · March 14, 2021, 10:24am

I’m just using Pytorch 1.7.1 now.

Thank you

Lucas_Tom · March 24, 2021, 4:33am

I got the same issue[Kitsunetic], what should I do?

ptrblck · March 24, 2021, 5:14am

It should be fixed already in the nightly conda binary and pip wheel.
Could you update and check it, please?

CC @Kitsunetic @namirinz

Space_Boy · March 28, 2021, 2:11am

I have the same issue with

print(torch.__version__) 
1.8.0+cu111

I have a 3090 RTX and torch 1.8 with cuda 11.1 is the only one compatible with Detectron2. Any idea when it will be fixed @ptrblck ?
Thank you

MCvin · April 1, 2021, 3:22pm

I’m having the same issue with pytorch 1.8.1 and cuda 11.1.
The error is different though:

~/anaconda3/envs/pytorch-1.8.1/lib/python3.8/site-packages/torchvision/models/detection/_utils.py in __call__(self, matched_idxs)
     43         neg_idx = []
     44         for matched_idxs_per_image in matched_idxs:
---> 45             positive = torch.where(matched_idxs_per_image >= 1)[0]
     46             negative = torch.where(matched_idxs_per_image == 0)[0]
     47 

RuntimeError: CUDA error: device-side assert triggered

I works with:

pytorch 1.8.1 + cuda 10.2
pytorch 1.7.1 + cuda 11.0

Seems like cuda 11.1 is the problem here.
Sorry I couldn’t help more.

ptrblck · April 2, 2021, 5:50am

The radix_sort is already fixed in the nightly release and PyTorch 1.8.1, so you would have to update to one of these versions.

@MCvin your error seems to be different.
Could you rerun the code via CUDA_LAUNCH_BLOCKING=1 python setup.py args and post the complete stack trace (or create a new topic with your error and this information)?

gabridego · April 2, 2021, 3:45pm

After updating to 1.8.1 I’m not getting the radix_sort error anymore, but I’m getting the same error as @MCvin. I’m using the stable version with CUDA 11.1 on Ubuntu 18.04. Everything works when run on CPU.

  File "~/anaconda3/envs/neo/lib/python3.9/site-packages/pl_bolts/models/detection/faster_rcnn/faster_rcnn_module.py", line 112, in training_step
    loss_dict = self.model(images, targets)
  File "~/anaconda3/envs/neo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "~/anaconda3/envs/neo/lib/python3.9/site-packages/torchvision/models/detection/generalized_rcnn.py", line 97, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "~/anaconda3/envs/neo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "~/anaconda3/envs/neo/lib/python3.9/site-packages/torchvision/models/detection/rpn.py", line 364, in forward
    loss_objectness, loss_rpn_box_reg = self.compute_loss(
  File "~/anaconda3/envs/neo/lib/python3.9/site-packages/torchvision/models/detection/rpn.py", line 296, in compute_loss
    sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
  File "~/anaconda3/envs/neo/lib/python3.9/site-packages/torchvision/models/detection/_utils.py", line 46, in __call__
    positive = torch.where(matched_idxs_per_image >= 1)[0]
RuntimeError: CUDA error: device-side assert triggered

The error follows a long series of messages like:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

I guess the problem is with the torch.where operation in torchvision/models/detection/_utils.py. I’ve been able to run my program moving the variable matched_idxs_per_image to CPU in the __call__ function of class BalancedPositiveNegativeSampler, like:

for matched_idxs_per_image in matched_idxs:
    matched_idxs_per_image = matched_idxs_per_image.cpu()
    positive = torch.where(matched_idxs_per_image >= 1)[0]
    negative = torch.where(matched_idxs_per_image == 0)[0]

from line 44.

Hope that helps.

ppwwyyxx · April 2, 2021, 8:53pm

detectron2 is also seeing an increasing number of CUDA error reports on CUDA>=11.1 + pytorch 1.8.x + RTX30xx: RuntimeError: CUDA error: device-side assert triggered · Issue #2837 · facebookresearch/detectron2 · GitHub

Root cause seems to be still randperm: CUDA error: device-side assert triggered(torch1.8.1+cuda11.1) · Issue #55027 · pytorch/pytorch · GitHub

chrischoy · April 8, 2021, 12:55am

I’m facing the same issue on MinokowskiEngine with pytorch 1.8.X + CUDA 11.X. Cuda 11.1 - Coordinate manager · Issue #330 · NVIDIA/MinkowskiEngine (github.com)

oo_o · April 17, 2021, 5:37pm

Same here, seems to be related to randperm()

Reproducible code:

>>> import torch
>>> device = torch.device("cuda:0")
>>> torch.randperm(29999, device=device)
tensor([13324, 19251, 23333,  ..., 18540, 14502, 26766], device='cuda:0')
>>> torch.randperm(30000, device=device)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal

pytorch version: 1.8.0+cu111

ptrblck · April 19, 2021, 7:17am

Could you update PyTorch to 1.8.1 or the nightly as described in my previous post, please?

Kitsunetic · April 19, 2021, 10:07am

Tested on 1.8.1 and nightly with same environment.
At 1.8.1, the error still happened but nightly works fine.
Thank you

kehuantiantang · April 29, 2021, 10:13am

Same problem happened when install nightly version by pip wheel, 1.9.0.dev20210428+cu111, python 3.8.8

  File "/opt/conda/envs/fpn/lib/python3.8/site-packages/torchvision/models/detection/rpn.py", line 363, in forward
    loss_objectness, loss_rpn_box_reg = self.compute_loss(
  File "/opt/conda/envs/fpn/lib/python3.8/site-packages/torchvision/models/detection/rpn.py", line 295, in compute_loss
    sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
  File "/opt/conda/envs/fpn/lib/python3.8/site-packages/torchvision/models/detection/_utils.py", line 45, in __call__
    positive = torch.where(matched_idxs_per_image >= 1)[0]
RuntimeError: CUDA error: device-side assert triggered

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:97: operator(): block: [0,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Same code, and problem solved by pytorch 1.7.1 + cuda 11.0, I think this may be the problem of cuda 11.1

ptrblck · April 29, 2021, 5:28pm

Did you check the indices, which create the error?
If so, what are the min and max values of the indices and what is the shape of the indexed tensor?

pierlj · May 3, 2021, 9:59am

I had the same error message as @MCvin with pytorch 1.8.1 running faster r-cnn code. I tried a few things to reproduce it. It seems to work fine for small value of n. I tried this while running FRCNN code through vscode debugging. Yet, I was not able to reproduce it outside like @oo_o did.

pytorch_issue

It was fixed by updating with nightly build ‘1.9.0.dev20210502’.

OS: Ubuntu 20.04
GPU: RTX 3090
Pytorch 1.8.1 / Cuda 11.1