Multiple GPU parallel computation

I have four batch dataset and I would like to know if I can process them in parallel with four GPU. I have looked at the tutorial of DataParallel function. However, it is a bit different for me because I’m not actually running a model or network. I simply run an objective function and compute the gradient of it. So I’m simply constructing my code follows the computation flow, and then package it as a function. Is there anyway that I can make it parallel with several GPUs?


As a first step you might want to see if explicitly assignment tensors to different devices e.g., cuda:0 and cuda:1 and running the computation yields any speedup, as the CUDA operations should be asynchronous and be parallelizable on different GPUs. However, if your batch dimension is 4, then there may be bottlenecks due to underutilization depending on how much computation the objective function contains. Additionally, you might want to investigate packaging your objective function into an nn.Module even if it doesn’t follow a typical architecture as it would allow you to use DistributedDataParallel which reduces overheads caused by a single-process setup.

Thanks for your suggestions. I may not be very clear about my setting. Basically I have four datasets, and each of them consists of same size of training data. So I think you can view it as four independent batch, and each batch has a N dimension. That is why I think I should have four GPU running four datasets. Currently I’m running these four datasets with a for loop, so I expect to see parallel GPU computation would speed up the whole process.

By the way, is there any useful document that I can read to package my function into a module? Or what keyword should I use to search for it?


You might want to read some canonical training scripts e.g., the ImageNet example here:

One place to start might be to see if you could package your objective function in the forward function of an nn.Module:

My original suggestion is that if you have four datasets, you could explicitly specify different devices for the .cuda() calls for your different datasets: torch.Tensor.cuda — PyTorch 1.10 documentation, which should be able to automatically parallelize things for simple workflows that don’t call torch.cuda.synchronize somewhere.

1 Like