This error pops up after the 9th iteration of training
Traceback (most recent call last):
File "train.py", line 75, in <module>
train(args)
File "train.py", line 64, in train
output = CIA_interface.cia_forward(batch ,epoch ,i)
File "/notebooks/E2E/cia_interface.py", line 90, in cia_forward
losses = self.model(example, return_loss = True)
File "/opt/conda/envs/btngan1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/notebooks/cia/det3d/models/detectors/voxelnet.py", line 38, in forward
return self.bbox_head.loss(example, preds)
File "/notebooks/cia/det3d/models/bbox_heads/mg_head_v4_release.py", line 612, in loss
iou_pred_loss = iou_pred_loss.sum() / batch_size
RuntimeError: CUDA error: invalid configuration argument
Could you post the output of python -m torch.utils.collect_env
as well as an executable code snippet to reproduce this issue?
(btngan1) root@n1byw4j8oz:/notebooks# python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.6.0
Is debug build: No
CUDA used to build PyTorch: 10.2
OS: Ubuntu 20.04.1 LTS
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
CMake version: version 3.19.4
Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Quadro P5000
Nvidia driver version: 450.36.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.6.0
[pip3] torch-scatter==2.0.6
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.7.0
[pip3] vit-pytorch==0.15.2
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.3.0 py38h54f3939_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] pytorch 1.6.0 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch
[conda] torch-scatter 2.0.6 pypi_0 pypi
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.7.0 py38_cu102 pytorch
[conda] vit-pytorch 0.15.2 pypi_0 pypi
The code is very big I will try to but it on a github repo and send it
Thanks! Could you update PyTorch in the meantime to the latest nightly release (in a new virtual env, in necessary) and rerun your code?
I will try, the thing is that there is to much dependencies on the pytorch and cuda version due to there is a lot of libraries to setup like the IOU3D_Cuda inhereted from the PointRCNN model and some other things. Can you tell me what is this error for or what is meant by it, it is non intuitive error, and the thing is that the line that produces the error executed normally 8 times before.
The error is raised by an invalid kernel launch config.
In case you are using a custom CUDA extension, you could try to rerun the code via CUDA_LAUNCH_BLOCKING=1 python script.py args
and check the kernel launch configs in the failing operation given by the stack trace.
Traceback (most recent call last):
File "train.py", line 75, in <module>
train(args)
File "train.py", line 64, in train
output = CIA_interface.cia_forward(batch ,epoch ,i)
File "/notebooks/E2E/cia_interface.py", line 90, in cia_forward
losses = self.model(example, return_loss = True)
File "/opt/conda/envs/btngan1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/notebooks/cia/det3d/models/detectors/voxelnet.py", line 38, in forward
return self.bbox_head.loss(example, preds)
File "/notebooks/cia/det3d/models/bbox_heads/mg_head_v4_release.py", line 612, in loss
iou_pred_loss = iou_pred_loss.sum() / batch_size
RuntimeError: CUDA error: invalid configuration argument
it gives me the same error
The error is in the sum() function and yes I uses IOU3D_CUDA from this repo: GitHub - sshaoshuai/PointRCNN: PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud, CVPR 2019.
In that case you could e.g. add debug prints to the CUDA code and check the launch configs, as apparently one kernel call (in the 9th iteration) is using invalid values (I guess the grid or block dimension might be too large).
Well, I’m new using CUDA extensions, could you illustrate to me where to put it as the file extension or something like this.
Okay now I changed the random seed, and the error became in the 11th iteration
@ptrblck
You could check the launch config e.g. here and make sure the values are valid.
I’m not familiar with the code base and this was the first .cu
file I’ve found, so the error could of course be raised by another kernel.