Multi-GPU operation for general tensor operation

yzhao062 · December 26, 2020, 5:12pm

Hi there,

Have a question regarding how to leverage torch for general tensor operations (e.g., matmul, cdist, etc.) other than deep learning. So I do not have a training process but a simple calculation.

For instance, I would like to calculate the pairwise distance of two large matrices (100,000 samples, 128 dimensions) with four GPUs (cuda:0,1,2,3). A single GPU does not have enough memory for doing so.

A = torch.randn(100000, 128).cuda()

B = torch.randn(100000, 128).cuda()

pdist = torch.nn.PairwiseDistance(p=2)

pairwise_distance = pdist(A, B)

My questions are:

how to easily split the task to multiple GPUs (just like joblib with native Python)?
how to do the calculation in parallel (I could retrieve the gpu id and split the matrix into multiple fold with a for loop)
does pytorch multiprocessing also handle data split with multiple GPU? I am afraid that is not the case.

Thanks for the help and happy new year!

mrshenli · December 27, 2020, 4:33am

Hey @yzhao062

how to easily split the task to multiple GPUs (just like joblib with native Python)?

Not sure if this is helpful. The code below is how DataParallel parallelizes work on multiple GPUs using multi-threading. But matmul is more complicated than data parallel. You will need to implement your own parallel version if you would like to parallelize one matmul operation across multiple GPUs.

github.com

pytorch/pytorch/blob/963f7629b591dc9750476faf1513bc7f1fb4d6de/torch/nn/parallel/parallel_apply.py#L23-L88


      
          def parallel_apply(modules, inputs, kwargs_tup=None, devices=None):
              r"""Applies each `module` in :attr:`modules` in parallel on arguments
              contained in :attr:`inputs` (positional) and :attr:`kwargs_tup` (keyword)
              on each of :attr:`devices`.
          
              Args:
                  modules (Module): modules to be parallelized
                  inputs (tensor): inputs to the modules
                  devices (list of int or torch.device): CUDA devices
          
              :attr:`modules`, :attr:`inputs`, :attr:`kwargs_tup` (if given), and
              :attr:`devices` (if given) should all have same length. Moreover, each
              element of :attr:`inputs` can either be a single object as the only argument
              to a module, or a collection of positional arguments.
              """
              assert len(modules) == len(inputs)
              if kwargs_tup is not None:
                  assert len(modules) == len(kwargs_tup)
              else:
                  kwargs_tup = ({},) * len(modules)

This file has been truncated. show original

You can also use torch.multiprocessing and use shared_memory to share tensors.

how to do the calculation in parallel (I could retrieve the gpu id and split the matrix into multiple fold with a for loop).

It depends on which operator you need. Some operators like add and abs can be easily parallelized, and you can use the same strategy as used by DataParallel to achieve that. Other operators will be harder, especially if they require cross-device communication/data-sharing during the computation.

does pytorch multiprocessing also handle data split with multiple GPU? I am afraid that is not the case.

PyTorch supports splitting a tensor in one process, and then share each split with a different process with torch.multiprocessing and shared_memory tensors, but the difficult part will be how to implement multi-device computation.