Tensor Inverse in parallel over mutliple GPUs

I want to run over multiple GPUs in parallel torch.inverse(). I saw this post Matmul on multiple GPUs. Which shows that if you have multiple tensors allocated to each GPU matmul will be run in parallel. I was able to replicate this behavior for matmul but when I try to do the same thing for torch.inverse() it seems to run sequentially when I watch nvidia-smi. Any ideas?

Could you post your code snippet so that we could have a look?

1 Like

Thank you for the quick reply. As you can see it greatly mirrors the other post.

import torch

ngpu = torch.cuda.device_count()
# This is the allocation to each GPU.
lis = []
for i in range(ngpu):
    lis.append(torch.rand(5000,5000,device = 'cuda:'+ str(i)))

# per the matmul on multiple GPUs post this should already be in parallel to my understanding
# but doesnt seem to be based on watch nvidia-smi
C_ = []
for i in range(ngpu):