Tensor Algebra, Data Parallelism


(Yubei Chen) #1

This question is purely tensor related and doesn’t touch nn or so. I have a pretty trivial question:

I have 4 GPUs in each node. I wonder if there is a way to use the all the GPUs in one node to do data-parallel tensor algebra. Let’s say I have 4 independent equal-size linear system to solve, I can put each of them on a different GPU, then how can I make the GPUs start together and then synchronize once finished. I can using threading, but before that I feel maybe PyTorch already has something to handle this. Any suggestion?


(colesbury) #2
As = [...]  # list of tensors
Bs = [...]  # list of tensors

#copy each tensor to it a GPU
for i in range(4):
 As[i] = As[i].cuda(device=i)
 Bs[i] = Bs[i].cuda(device=i)

# solve linear system
outputs = []
for i in range(4):
   outputs.append(torch.gesv(Bs[i], As[i])

for i in range(4):
  with torch.cuda.device(i):
    torch.cuda.synchronize()

I don’t know why you’d want to explicitly synchronize, though. Copying to the CPU will also synchronize.


(Yubei Chen) #3

Thanks a lot!

One thing I don’t quite understand here is:

According to the document, it doesn’t say that this torch.gesv(Bs[i], As[i]) is a non-blocking execution. So my understanding is that the next linear system will be dispatched to the next GPU only after the previous one is finished. If this is the case, then the tasks are still executed sequentially, which is less optimal. Is there some place to read about the blocking or non-blocking mechanism in Pytorch?

Btw, synchronization is not necessary for the given example, I was trying to figure out an optimal way to do non-blocking execution and synchronization with PyTorch.
Thanks again!


(colesbury) #4

CUDA calls are generally asynchronous with respect to the host. They happen in-order on each GPU (within a stream). However, the As[i].cuda(device=i) and Bs[i] = Bs[i].cuda(device=i) will block the host.

There’s a limit to the number of outstanding calls, though (about 1022) after which they will block.

You can also stick the calls to gesv in separate threads.