CUDA kernel that accumulates results

Say I have a CUDA kernel that acts on two tensors and one of the tensors accumulates results.

template <typename scalar_t>
__global__ void CUDA_kernel(
    torch::PackedTensorAccessor32<scalar_t,2,torch::RestrictPtrTraits> A,
    torch::PackedTensorAccessor32<scalar_t,1> accu) {

  const int index = blockIdx.x * blockDim.x + threadIdx.x;
  bool result = blabla(A[index]);

  if (result) accu[0] += 1;
  else accu[1] += 1;

In this case, the following should be true: sum(accu) == len(A)
However, in reality accu[0] == 1 and accu[1] == 1.

The cause for the error is obvious: every thread seems to overwrite accu thinking that accu[x] == 0. Is there a way to make this work?

Does this have to do with torch::RestrictPtrTraits? I feel stupid because I did not give accu RestrictPtrTraits.

For anyone who is having the same problem. The solution is described here: