Encountering error: Unrecognized tensor type ID: AutogradCUDA

krishna_kishore · November 11, 2020, 9:24am

Traceback (most recent call last):
File “refcoco/train_end2end.py”, line 60, in
main()
File “refcoco/train_end2end.py”, line 54, in main
rank, model = train_net(args, config)
File “/content/gdrive/My Drive/DDP/VL-BERT/refcoco/…/refcoco/function/train.py”, line 323, in train_net
gradient_accumulate_steps=config.TRAIN.GRAD_ACCUMULATE_STEPS)
File “/content/gdrive/My Drive/DDP/VL-BERT/refcoco/…/common/trainer.py”, line 115, in train
outputs, loss = net(*batch)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/content/gdrive/My Drive/DDP/VL-BERT/refcoco/…/common/module.py”, line 22, in forward
return self.train_forward(*inputs, **kwargs)
File “/content/gdrive/My Drive/DDP/VL-BERT/refcoco/…/refcoco/modules/resnet_vlbert_for_refcoco.py”, line 96, in train_forward
segms=None)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/content/gdrive/My Drive/DDP/VL-BERT/refcoco/…/common/fast_rcnn.py”, line 149, in forward
roi_align_res = self.roi_align(img_feats[‘body4’], rois).type(images.dtype)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/content/gdrive/My Drive/DDP/VL-BERT/refcoco/…/common/lib/roi_pooling/roi_align.py”, line 69, in forward
input.float(), rois.float(), self.output_size, self.spatial_scale, self.sampling_ratio
File “/content/gdrive/My Drive/DDP/VL-BERT/refcoco/…/common/lib/roi_pooling/roi_align.py”, line 20, in forward
input, rois, spatial_scale, output_size[0], output_size[1], sampling_ratio
RuntimeError: Unrecognized tensor type ID: AutogradCUDA

I am running on google colab, using pytorch version 1.7.0 , torchvision 0.8.1 and cuda 10.1. Same error is coming with cuda 9.2 also.

ptrblck · November 11, 2020, 10:33am

Based on the stack trace it seems you might be using a custom roi_align method, which is raising this issue. If so, could you check which line of code exactly is raising the error?

HeyangQin · November 14, 2020, 2:34am

I got the same error after upgrading Pytorch. It seems the problem is caused by apex compiled with the old Pytorch. After reinstalling apex from source, the problem is gone.

chandrachud · November 19, 2020, 3:36pm

Hi, I have the same error. Using PyTorch 1.7 on CUDA 11.0 installed thus:

pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

The error comes from line 138 on this script: https://github.com/MhLiao/DB/blob/master/assets/ops/dcn/functions/deform_conv.py
which calls the CUDA function in the compiled deform_conv_cuda.

I had to make some changes to the .cpp files back when upgrading to PyTorch 1.6, but had no problems after that. This is totally new.

Not sure if apex is being used?

kroq-gar78 · December 1, 2020, 6:09pm

I also had this error after upgrading from PyTorch 1.6 to 1.7 (except with AutogradCPU). The codebase I’m working with has a C++ extension, and recompiling it solved the problem.

Bill_Li · January 11, 2021, 9:35pm

Can you share what changes you make (do you make them in the /site-package/ path)? Thanks.

chandrachud · January 12, 2021, 9:46am

Hey, we tried using Pytorch 1.8 (nightly build), and that solved the issue.

massyzs · March 12, 2022, 10:53am

You saved my life. What you said really solved the problem confusing me about a whole week!