# Multi-GPU operation for general tensor operation

Hi there,

Have a question regarding how to leverage torch for general tensor operations (e.g., matmul, cdist, etc.) other than deep learning. So I do not have a training process but a simple calculation.

For instance, I would like to calculate the pairwise distance of two large matrices (100,000 samples, 128 dimensions) with four GPUs (cuda:0,1,2,3). A single GPU does not have enough memory for doing so.

A = torch.randn(100000, 128).cuda()

B = torch.randn(100000, 128).cuda()

pdist = torch.nn.PairwiseDistance(p=2)

pairwise_distance = pdist(A, B)

My questions are:

• how to easily split the task to multiple GPUs (just like joblib with native Python)?
• how to do the calculation in parallel (I could retrieve the gpu id and split the matrix into multiple fold with a for loop)
• does pytorch multiprocessing also handle data split with multiple GPU? I am afraid that is not the case.

Thanks for the help and happy new year!

Hey @yzhao062

how to easily split the task to multiple GPUs (just like joblib with native Python)?

Not sure if this is helpful. The code below is how `DataParallel` parallelizes work on multiple GPUs using multi-threading. But `matmul` is more complicated than data parallel. You will need to implement your own parallel version if you would like to parallelize one `matmul` operation across multiple GPUs.

You can also use `torch.multiprocessing` and use `shared_memory` to share tensors.

how to do the calculation in parallel (I could retrieve the gpu id and split the matrix into multiple fold with a for loop).

It depends on which operator you need. Some operators like `add` and `abs` can be easily parallelized, and you can use the same strategy as used by `DataParallel` to achieve that. Other operators will be harder, especially if they require cross-device communication/data-sharing during the computation.

does pytorch multiprocessing also handle data split with multiple GPU? I am afraid that is not the case.

PyTorch supports splitting a tensor in one process, and then share each split with a different process with `torch.multiprocessing` and `shared_memory` tensors, but the difficult part will be how to implement multi-device computation.

1 Like