Cuda error, happens in the 9th iteration

Abdelrahman_Akram · May 26, 2021, 8:42am

This error pops up after the 9th iteration of training


Traceback (most recent call last):
  File "train.py", line 75, in <module>
    train(args)
  File "train.py", line 64, in train
    output =  CIA_interface.cia_forward(batch ,epoch ,i)
  File "/notebooks/E2E/cia_interface.py", line 90, in cia_forward
    losses =  self.model(example, return_loss = True)
  File "/opt/conda/envs/btngan1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/notebooks/cia/det3d/models/detectors/voxelnet.py", line 38, in forward
    return self.bbox_head.loss(example, preds)
  File "/notebooks/cia/det3d/models/bbox_heads/mg_head_v4_release.py", line 612, in loss
    iou_pred_loss = iou_pred_loss.sum() / batch_size
RuntimeError: CUDA error: invalid configuration argument

ptrblck · May 26, 2021, 8:56am

Could you post the output of python -m torch.utils.collect_env as well as an executable code snippet to reproduce this issue?

Abdelrahman_Akram · May 26, 2021, 9:02am

(btngan1) root@n1byw4j8oz:/notebooks# python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.6.0
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 20.04.1 LTS
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
CMake version: version 3.19.4

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Quadro P5000
Nvidia driver version: 450.36.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.6.0
[pip3] torch-scatter==2.0.6
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.7.0
[pip3] vit-pytorch==0.15.2
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.3.0            py38h54f3939_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.19.2           py38h54aff64_0  
[conda] numpy-base                1.19.2           py38hfa32c7d_0  
[conda] pytorch                   1.6.0           py3.8_cuda10.2.89_cudnn7.6.5_0    pytorch
[conda] torch-scatter             2.0.6                    pypi_0    pypi
[conda] torchsummary              1.5.1                    pypi_0    pypi
[conda] torchvision               0.7.0                py38_cu102    pytorch
[conda] vit-pytorch               0.15.2                   pypi_0    pypi

Abdelrahman_Akram · May 26, 2021, 9:03am

The code is very big I will try to but it on a github repo and send it

ptrblck · May 26, 2021, 9:04am

Thanks! Could you update PyTorch in the meantime to the latest nightly release (in a new virtual env, in necessary) and rerun your code?

Abdelrahman_Akram · May 26, 2021, 9:08am

I will try, the thing is that there is to much dependencies on the pytorch and cuda version due to there is a lot of libraries to setup like the IOU3D_Cuda inhereted from the PointRCNN model and some other things. Can you tell me what is this error for or what is meant by it, it is non intuitive error, and the thing is that the line that produces the error executed normally 8 times before.

ptrblck · May 26, 2021, 9:22am

The error is raised by an invalid kernel launch config.
In case you are using a custom CUDA extension, you could try to rerun the code via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the kernel launch configs in the failing operation given by the stack trace.

Abdelrahman_Akram · May 26, 2021, 9:25am

Traceback (most recent call last):
  File "train.py", line 75, in <module>
    train(args)
  File "train.py", line 64, in train
    output =  CIA_interface.cia_forward(batch ,epoch ,i)
  File "/notebooks/E2E/cia_interface.py", line 90, in cia_forward
    losses =  self.model(example, return_loss = True)
  File "/opt/conda/envs/btngan1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/notebooks/cia/det3d/models/detectors/voxelnet.py", line 38, in forward
    return self.bbox_head.loss(example, preds)
  File "/notebooks/cia/det3d/models/bbox_heads/mg_head_v4_release.py", line 612, in loss
    iou_pred_loss = iou_pred_loss.sum() / batch_size
RuntimeError: CUDA error: invalid configuration argument

it gives me the same error
The error is in the sum() function and yes I uses IOU3D_CUDA from this repo: GitHub - sshaoshuai/PointRCNN: PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud, CVPR 2019.

ptrblck · May 26, 2021, 9:52am

In that case you could e.g. add debug prints to the CUDA code and check the launch configs, as apparently one kernel call (in the 9th iteration) is using invalid values (I guess the grid or block dimension might be too large).

Abdelrahman_Akram · May 26, 2021, 9:57am

Well, I’m new using CUDA extensions, could you illustrate to me where to put it as the file extension or something like this.

Abdelrahman_Akram · May 26, 2021, 1:39pm

Okay now I changed the random seed, and the error became in the 11th iteration
@ptrblck

ptrblck · May 26, 2021, 6:36pm

You could check the launch config e.g. here and make sure the values are valid.
I’m not familiar with the code base and this was the first .cu file I’ve found, so the error could of course be raised by another kernel.