THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp

ivanwill · May 7, 2019, 8:04am

Hello, I am trying old models on new RTX2080 on Ubuntu 16.04 with nvidia driver 410.57

I’m running some legacy deep learning model using pytorch 0.4.1, in which the model must useRoI Align and NMS which are compiled in pytorch 0.4.1 using ffi instead of cpp, which raised an error when run on pytorch >=1.0. Since pytorch 0.4.1 only support CUDA<10.0, I install CUDA 9.0 (incl. 4 pathces), with CUDNN 7.5.1

I got THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp silent error, while the model still running. After running the model, I must restart first before running other model


THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
/home/ivanwilliam/.virtualenvs/virtual-py3/lib/python3.5/site-packages/torch/nn/functional.py:1961: UserWarning: Default upsampling behavior when mode=trilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  "See the documentation of nn.Upsample for details.".format(mode))
tr. batch 1/230 (ep. 1) fw 27.761s / bw 2.157s / total 29.917s || loss: 1.04, class: 0.88, bbox: 0.16
tr. batch 2/230 (ep. 1) fw 0.864s / bw 0.985s / total 1.849s || loss: 0.95, class: 0.72, bbox: 0.23
tr. batch 3/230 (ep. 1) fw 0.860s / bw 0.985s / total 1.845s || loss: 1.23, class: 0.87, bbox: 0.36
tr. batch 4/230 (ep. 1) fw 0.906s / bw 0.990s / total 1.896s || loss: 1.19, class: 0.87, bbox: 0.33
tr. batch 5/230 (ep. 1) fw 0.908s / bw 0.981s / total 1.889s || loss: 0.79, class: 0.63, bbox: 0.16
tr. batch 6/230 (ep. 1) fw 0.920s / bw 0.652s / total 1.573s || loss: 0.87, class: 0.87, bbox: 0.00
tr. batch 7/230 (ep. 1) fw 0.915s / bw 0.983s / total 1.899s || loss: 1.00, class: 0.79, bbox: 0.22
tr. batch 8/230 (ep. 1) fw 0.883s / bw 0.981s / total 1.864s || loss: 0.84, class: 0.79, bbox: 0.06
tr. batch 9/230 (ep. 1) fw 0.852s / bw 0.987s / total 1.839s || loss: 1.33, class: 0.95, bbox: 0.39
tr. batch 10/230 (ep. 1) fw 0.933s / bw 0.990s / total 1.923s || loss: 1.37, class: 0.95, bbox: 0.43
tr. batch 11/230 (ep. 1) fw 0.929s / bw 0.986s / total 1.915s || loss: 1.11, class: 0.87, bbox: 0.24
tr. batch 12/230 (ep. 1) fw 0.927s / bw 0.987s / total 1.915s || loss: 1.00, class: 0.79, bbox: 0.21

Note: I tried to search /pytorch/aten/src/THC/THCGeneral.cpp file, but it doesn’t exists

Does the silent error will force the model on CPU or complicates other process?
Do you have some trick for RTX so it can run pytorch 0.41 model?

eveyre · January 3, 2020, 3:11am

hi,my GPU is RTX1650,pytorch 0.4.1 ,cuda 9.0 , i have exactly the same problem…does anybody solve it?

ptrblck · January 3, 2020, 7:46am

Turing GPUs are supported using CUDA>=10, so you would need to either update PyTorch to the latest stable release (1.3.1) with CUDA10.1 or try to build PyTorch 0.4.1 with CUDA>=10.0 from source, if you really need the old version.

111289 · April 17, 2020, 3:02am

yeah ，i met the same problem with RTX2080super
Ubuntu 18.04 cuda 10.1，torch 1.3.1,cudnn 7.6.5

ptrblck · April 17, 2020, 3:21am

Could you update to the latest stable PyTorch version and rerun your code, please?
If you still see this error, could you post a code snippet to reproduce this error?