Parallelizing for loop iterations on GPU

I want to parallelize a simple for loop computation that iterates over a list of pairs (stored as PyTorch tensor) to run over a GPU. The computation inside the loop doesn’t seem to be the bottleneck, time is consumed because of the huge input size.

What it does?
Just applies a function to each pair and updates its value. Then increments a matrix according to the received pair as the index.
All the iterations of the loop are independent of each other, hence can be parallelized, however, I’m not getting a way to do that in PyTorch so that efficiency can be improved with CUDA. Any kind of help would be appreciated (I’m open to vectorizing, multiprocessing, or change in input style).
Thank you :smile:

for i in range(0, batch_size):
    r_Pairs[i] = torch.floor(random.uniform(0.0, 1.0) * r_Pairs[i] * F) % F
    matrix[r_Pairs[i][0]][r_Pairs[i][1]] += 1

line 2 is trivially convertable to vectorized form, just with torch RNG

I don’t remember if line 3 if easily vectorizable - there is scatter_add_, but you may need to convert to 1d view/indexes (i1*size1+i2). Or do that scattering loop on cpu, that will be faster.

1 Like