I have a cuda kernel taking the argument:
torch::PackedTensorAccessor32<scalar_t, 4, torch::RestrictPtrTraits> input
How do I perform atomic add on the elements of
input[n][c][y][x] += (scalar_t) 1 works, but
atomicAdd(&input[n][c][y][x], 1) does not. The problem is, that it adds 256 instead of 1 to the tensor at the specified position. Thus I assume, that there is a problem with type conversions.
How can I do atomicAdd properly?
cc @ptrblck that is more used to cuda primitives.
Could you pass the value as
scalar_t or post a small code snippet to reproduce this issue, please?
It works now. There was another mistake on my side.
I ran into the same problem. The bug is very likely that
input tensor is on
cpu rather than
In my case, I define my output tensor using
torch::zeros() rather than
torch::zeros_like(), then my output tensor is on cpu and will not be modified by GPU functions.
In fact, an
illegal memory access exception is thrown, but we can’t see it unless we explicitly synchronize after the kernel launch.
Add synchronizing code after kernel launching is very useful for my debugging
cudaError_t cudaerr = cudaDeviceSynchronize();
if (cudaerr != cudaSuccess)
printf("kernel launch failed with error \"%s\".\n", cudaGetErrorString(cudaerr));
It makes the error message much more meaningful.