Multi-GPU operation for general tensor operation

Hi there,

Have a question regarding how to leverage torch for general tensor operations (e.g., matmul, cdist, etc.) other than deep learning. So I do not have a training process but a simple calculation.

For instance, I would like to calculate the pairwise distance of two large matrices (100,000 samples, 128 dimensions) with four GPUs (cuda:0,1,2,3). A single GPU does not have enough memory for doing so.

A = torch.randn(100000, 128).cuda()

B = torch.randn(100000, 128).cuda()

pdist = torch.nn.PairwiseDistance(p=2)

pairwise_distance = pdist(A, B)

My questions are:

  • how to easily split the task to multiple GPUs (just like joblib with native Python)?
  • how to do the calculation in parallel (I could retrieve the gpu id and split the matrix into multiple fold with a for loop)
  • does pytorch multiprocessing also handle data split with multiple GPU? I am afraid that is not the case.

Thanks for the help and happy new year!

Hey @yzhao062

how to easily split the task to multiple GPUs (just like joblib with native Python)?

Not sure if this is helpful. The code below is how DataParallel parallelizes work on multiple GPUs using multi-threading. But matmul is more complicated than data parallel. You will need to implement your own parallel version if you would like to parallelize one matmul operation across multiple GPUs.

You can also use torch.multiprocessing and use shared_memory to share tensors.

how to do the calculation in parallel (I could retrieve the gpu id and split the matrix into multiple fold with a for loop).

It depends on which operator you need. Some operators like add and abs can be easily parallelized, and you can use the same strategy as used by DataParallel to achieve that. Other operators will be harder, especially if they require cross-device communication/data-sharing during the computation.

does pytorch multiprocessing also handle data split with multiple GPU? I am afraid that is not the case.

PyTorch supports splitting a tensor in one process, and then share each split with a different process with torch.multiprocessing and shared_memory tensors, but the difficult part will be how to implement multi-device computation.

1 Like