I want to parallelize a simple
for loop computation that iterates over a list of pairs (stored as PyTorch tensor) to run over a GPU. The computation inside the loop doesn’t seem to be the bottleneck, time is consumed because of the huge input size.
What it does?
Just applies a function to each pair and updates its value. Then increments a matrix according to the received pair as the index.
All the iterations of the loop are independent of each other, hence can be parallelized, however, I’m not getting a way to do that in PyTorch so that efficiency can be improved with CUDA. Any kind of help would be appreciated (I’m open to vectorizing, multiprocessing, or change in input style).
for i in range(0, batch_size): r_Pairs[i] = torch.floor(random.uniform(0.0, 1.0) * r_Pairs[i] * F) % F matrix[r_Pairs[i]][r_Pairs[i]] += 1