How to do atomicAdd() on element returned by PackedTensorAccessor32

jvhoffbauer · February 17, 2020, 2:41pm

I have a cuda kernel taking the argument: torch::PackedTensorAccessor32<scalar_t, 4, torch::RestrictPtrTraits> input

How do I perform atomic add on the elements of input?

I.e. input[n][c][y][x] += (scalar_t) 1 works, but atomicAdd(&input[n][c][y][x], 1) does not. The problem is, that it adds 256 instead of 1 to the tensor at the specified position. Thus I assume, that there is a problem with type conversions.

How can I do atomicAdd properly?

albanD · February 17, 2020, 9:47pm

cc @ptrblck that is more used to cuda primitives.

ptrblck · February 18, 2020, 3:24am

Could you pass the value as scalar_t or post a small code snippet to reproduce this issue, please?

jvhoffbauer · April 27, 2020, 4:34pm

It works now. There was another mistake on my side.

xieshuqin · January 15, 2021, 7:25am

I ran into the same problem. The bug is very likely that input tensor is on cpu rather than gpu.

In my case, I define my output tensor using torch::zeros() rather than torch::zeros_like(), then my output tensor is on cpu and will not be modified by GPU functions.

In fact, an illegal memory access exception is thrown, but we can’t see it unless we explicitly synchronize after the kernel launch.

Add synchronizing code after kernel launching is very useful for my debugging

cudaError_t cudaerr = cudaDeviceSynchronize();
if (cudaerr != cudaSuccess)
    printf("kernel launch failed with error \"%s\".\n", cudaGetErrorString(cudaerr));

It makes the error message much more meaningful.