CUDA threadIdx issue

Jean_Da_Rolt · January 28, 2019, 2:16am

Hi,

I’m implementing a CUDA extension to be used inside Python code. However I’m getting a strange error and I’m having trouble to debug it.
The caller of the CUDA kernel uses this code:

dim3 grid_dim(94, 94, 6);
dim3 block_dim(64);
kernel<<grid_dim, block_dim>>(params...)

For some reason, I’m getting this error:

/opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [45,0,0], thread: [95,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
RuntimeError: merge_sort: failed to synchronize: device-side assert triggered

With that in mind, I have two questions:

If the block dimensions (how many threads are in a block) are (64,1,1), should I get this “thread: [95,0,0]”? Isn’t 95 out of the range of the block dimensions?
How can I debug a CUDA kernel inside Python code? (I usually use pdb for debugging python code, gdb for C++, or cuda-gbc for CUDA code, but I didn’t find a way to debug CUDA coda surrounded by a python wrapper. I thought about writing a test case just in CUDA, but then I need to compile it by hand and link Aten and other libraries by myself).

I welcome any suggestions that you may have.

Thank you,

albanD · January 28, 2019, 10:32am

Hi,

I think the error you see does not come from your kernel but from an indexing operation that gets an invalid index value:

Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

You can run your script with CUDA_LAUNCH_BLOCKING=1 to get a proper python stacktrace.

Jean_Da_Rolt · January 28, 2019, 12:20pm

Hi @albanD,

Thanks for the suggestion. However I already tried that (the stacktrace keeps unchanged). I believe that the error comes from the kernel since it is triggered from device-side.

albanD · January 28, 2019, 1:53pm

You can find here the assert that is triggered. So you can check where this function is called.

Jean_Da_Rolt · January 29, 2019, 1:15am

Hi @albanD,

I found the bug. I didn’t realize that ATen functions were generating those ranges for kernel threads inside the functions. Thanks for your help.

By the way, I see in a lot of codes a pattern like this:
at::Tensor t = at::empty({5, 10}); // creating tensor with at function
then
int* t2 = t.data<int>(); // getting the raw data
then
t2[row_idx*10 + col_idx] = value; //assigning values to t2

Do you know why people write like this instead of just using aten tensor directly like this:

at::Tensor t = at::empty({5, 10}); // creating tensor with at function
t[row_idx][col_idx] = value;

I think is something obvious, but I’m maybe too tired to understand.

Thank you again!

albanD · January 29, 2019, 10:10am

Aten Tensors are fairly recent compared to the old part of the codebase. So some of it was written before Aten tensor with pointers and was changed only to unpack the Tensor.
Also I assume performance reasons as the first one simply computes and offset in memory while the second will actually index the tensor twice.