Hi, what I am trying to do is the following:

I have a data array A (n, m) and an index array I of same size (n, m) and a result array R (x, n).

I am trying to scatter elements of A into R while also summing up all values which scatter to the same index.

This can be done in numpy for example in 1D arrays using np.histogram with the weights option.

This is one example in numba.cuda if it helps better explain what I want to do:

```
@cuda.jit
def rewireValues(R, A, I, totalthreads):
threadidx = (cuda.threadIdx.x + (cuda.blockDim.x * cuda.blockIdx.x))
if threadidx >= totalthreads:
return
nj = I.shape[1]
nk = I.shape[2]
idx = threadidx % nk
source = int(threadidx / nk) % nj
frame = int(threadidx / (nj * nk))
target = I[frame, source, idx]
if target == -1:
return
cuda.atomic.add(R, (frame, target, 0), A[frame, source, idx, 0])
```