How to do atomicAdd() on element returned by PackedTensorAccessor32

I have a cuda kernel taking the argument: torch::PackedTensorAccessor32<scalar_t, 4, torch::RestrictPtrTraits> input

How do I perform atomic add on the elements of input?

I.e. input[n][c][y][x] += (scalar_t) 1 works, but atomicAdd(&input[n][c][y][x], 1) does not. The problem is, that it adds 256 instead of 1 to the tensor at the specified position. Thus I assume, that there is a problem with type conversions.

How can I do atomicAdd properly?

cc @ptrblck that is more used to cuda primitives.

Could you pass the value as scalar_t or post a small code snippet to reproduce this issue, please?

It works now. There was another mistake on my side.

I ran into the same problem. The bug is very likely that input tensor is on cpu rather than gpu.

In my case, I define my output tensor using torch::zeros() rather than torch::zeros_like(), then my output tensor is on cpu and will not be modified by GPU functions.

In fact, an illegal memory access exception is thrown, but we can’t see it unless we explicitly synchronize after the kernel launch.

Add synchronizing code after kernel launching is very useful for my debugging

cudaError_t cudaerr = cudaDeviceSynchronize();
if (cudaerr != cudaSuccess)
    printf("kernel launch failed with error \"%s\".\n", cudaGetErrorString(cudaerr));

It makes the error message much more meaningful.