This question is purely tensor related and doesn’t touch nn or so. I have a pretty trivial question:

I have 4 GPUs in each node. I wonder if there is a way to use the all the GPUs in one node to do data-parallel tensor algebra. Let’s say I have 4 independent equal-size linear system to solve, I can put each of them on a different GPU, then how can I make the GPUs start together and then synchronize once finished. I can using threading, but before that I feel maybe PyTorch already has something to handle this. Any suggestion?