I am implementing a PyTorch C++ extension that employs parallellism. The desired result is a N x D matrix, and each thread computes one row of this result. What is the recommended way to accumulate these rows?

To put this into context, my code looks as follows (simplified):

```
at::Tensor nearestKKeys(at::Tensor queries, at::Tensor keys, int k, int maxLeafSize) {
/*
queries: (Nq, D)
keys: (Nk, D)
k: int
return: (Nq, k)
*/
int Nq = queries.size(0);
at::Tensor result = at::empty({Nq, k}, at::ScalarType::Int);
at::Tensor indices = at::arange({Nk}, keys.device())
BallTreePtr pBallTree = buildBallTree(
keys,
indices,
maxLeafSize,
0
);
#pragma omp parallel for
for (int n=0; n<Nq; n++) {
at::Tensor query = queries.index({n});
BestMatchesPtr pBestMatches = std::make_shared<BestMatches>(k);
// places the desired result for this query in *pBestMatches
pBallTree->query(query, query_norm, pBestMatches, k);
// place the desired row in results
at::Tensor matches = pBestMatches->getMatches();
result.index_put_({n}, matches);
}
}
```

This is currently not very efficient, and my profiler tells me that the threads spend a lot of time waiting on each other. I believe this is because of the `result`

variable being shared between threads. I don’t think that `pBallTree`

being shared really matters (because it’s a pointer), but do tell me if I’m mistaken. So, hence my question: What is the recommended way to accumulate these rows?