THCudaCheck FAIL invalid device function

Hi,

I try to use GitHub - fregu856/ebms_regression: Official implementation of "Energy-Based Models for Deep Probabilistic Regression" (ECCV 2020) and "How to Train Your Energy-Based Model for Regression" (BMVC 2020). repo with docker file. I think everything is ok, (Ubuntu 16.04 host file docker image run with success) I know this is very old but any chance to solve it?

It uses masrcnn_benchmark. I installed 2020 version of apex etc. it loads data and model but doesn’t start training.

compute capabilities 8.6

Regards.

2023-01-31 14:55:41,056 maskrcnn_benchmark INFO: Saving config into: /root/ebms_regression/detection/checkpoints/nce+/config.yml
2023-01-31 15:13:17,753 maskrcnn_benchmark INFO: Using 1 GPUs
2023-01-31 15:13:17,753 maskrcnn_benchmark INFO: Namespace(config_file=‘configs/nce+_train.yaml’, distributed=False, local_rank=0, opts=[], skip_test=False)
2023-01-31 15:13:17,753 maskrcnn_benchmark INFO: Collecting env info (might take some time)
2023-01-31 15:13:18,383 maskrcnn_benchmark INFO:
PyTorch version: 1.0.0.dev20190401
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.14.20190401-g3e12

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 512.78
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch-nightly==1.0.0.dev20190401
[pip3] torchvision-nightly==0.2.1

Selected optimization level O0: Pure FP32 training.

Defaults for this optimization level are:
enabled : True
opt_level : O0
cast_model_type : torch.float32
patch_torch_functions : False
keep_batchnorm_fp32 : None
master_weights : False
loss_scale : 1.0
Processing user overrides (additional kwargs that are not None)…
After processing overrides, optimization options are:
enabled : True
opt_level : O0
cast_model_type : torch.float32
patch_torch_functions : False
keep_batchnorm_fp32 : None
master_weights : False
loss_scale : 1.0
2023-02-05 11:30:55,867 maskrcnn_benchmark.utils.checkpoint INFO: Loading checkpoint from /root/ebms_regression/detection/pretrained_models/e2e_faster_R-50-FPN_1x.pkl
2023-02-05 11:30:56,796 maskrcnn_benchmark.utils.c2_model_loading INFO: Remapping C2 weights
2023-02-05 11:30:56,796 maskrcnn_benchmark.utils.c2_model_loading INFO: C2 name: bbox_pred_b mapped name: bbox_pred.bias
2023-02-05 11:30:56,797 maskrcnn_benchmark.utils.c2_model_loading INFO: C2 name: bbox_pred_w mapped name: bbox_pred.weight

creating index…
index created!
2023-02-05 10:49:27,779 maskrcnn_benchmark.utils.miscellaneous INFO: Saving labels mapping into /root/ebms_regression/detection/checkpoints/nce+/labels.json
2023-02-05 10:49:27,781 maskrcnn_benchmark.trainer INFO: Start training
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=383 error=8 : invalid device function

I don’t know which PyTorch version you are using and thus where exactly the code is failing, since THCGeneral.cpp was removed a while ago.
Could you execute the code with the latest PyTorch release and see if you would still be running into this issue?