Say I have a CUDA kernel that acts on two tensors and one of the tensors accumulates results.
template <typename scalar_t>
__global__ void CUDA_kernel(
torch::PackedTensorAccessor32<scalar_t,2,torch::RestrictPtrTraits> A,
torch::PackedTensorAccessor32<scalar_t,1> accu) {
const int index = blockIdx.x * blockDim.x + threadIdx.x;
bool result = blabla(A[index]);
if (result) accu[0] += 1;
else accu[1] += 1;
}
}
In this case, the following should be true: sum(accu) == len(A)
However, in reality accu[0] == 1
and accu[1] == 1
.
The cause for the error is obvious: every thread seems to overwrite accu
thinking that accu[x] == 0
. Is there a way to make this work?
Does this have to do with torch::RestrictPtrTraits
? I feel stupid because I did not give accu
RestrictPtrTraits.