Help understand CUDA error:device-side assert triggered

Hello, all,
A CUDA error occurs when I try to seq_range_expand = seq_range_expand.cuda(), the error message says:
/opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [0,0,0] Assertion val >= zerofailed. /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [1,0,0] Assertionval >= zerofailed. /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [2,0,0] Assertionval >= zerofailed. /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [3,0,0] Assertionval >= zerofailed. /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [4,0,0] Assertionval >= zerofailed. /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [5,0,0] Assertionval >= zerofailed. /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [6,0,0] Assertionval >= zerofailed.

Sorry if the question is too naive

Try to run with CUDA_LAUNCH_BLOCKING=1 python

The error message probably comes from a different line than the one you linked, but it is showing up there because of the asynchronous nature of CUDA. In particular, it looks like you called multinomial with a negative probability somewhere.

1 Like

Still the warning seems to be that sampleMultinomialOnce fails to get val >= zero.
So I guess you try to sample from a multinomial with negative values? Which is not supported?

Thank you! I am looking at other people’s code, and couldn’t find the part with multinomial, do you have any clues?

Thank you! Richard, yeah it is from another function.

You can run this on CPU or on GPU with the env variable CUDA_LAUNCH_BLOCKING=1 to get the exact python stack trace of where this happens.

I get similar error message when loss of pytorch model gets very big ie infinity. For example if I decrease learning rate then I wont get this error.

Here is example of errors I am getting:

/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [2,0,0] Assertion val >= zero failed.

As the error mentions, you try to sample from a multinational kernel with a negative weight.
You should check your code how can this happen.
This most likely happens with larger learning rate because you sample value based on trained parameters and a high learning rate lead to this negative value (either due to instability or because you did not run the low learning rate version long enough).

2 Likes

Thanks! I use simple actor-critic to give simple commands. So trained parameters effects to data I sample.