DataParallel scoring multi-GPU utilization

Hello, I followed the online DataParallel tutorial and I can’t get the model to split compute evenly among different GPUs at score-time (forward pass of trained model). On 3 GPUs, I get something like this:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00001961:00:00.0 Off | 0 |
| N/A 53C P0 224W / 300W | 15248MiB / 16130MiB | 93% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00003673:00:00.0 Off | 0 |
| N/A 49C P0 86W / 300W | 7004MiB / 16130MiB | 6% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00005A1F:00:00.0 Off | 0 |
| N/A 54C P0 76W / 300W | 6996MiB / 16130MiB | 85% Default |
±------------------------------±---------------------±---------------------+

So usually GPU 0 and 2 are loaded and 1 is underutilized. Also I get a very large lag in-between batches, almost 1-2 seconds of idle time when all three GPUs are at 0%, then they do some compute, then go to 0% again.

My guess is that syncing on GPU 0 is the culprit - is there a way I can run distributed operation on multiple GPUs for scoring in pyTorch to obtain even memory usage and compute across multiple GPUs? Notice how this is different from training as I’m not computing the loss and aggregating gradients.

The code is here: https://github.com/waldeland/CNN-for-ASI/blob/master/test_parallel.py and I already tried calling .to(device) before DataParallel and specifying “device_ids” - nothing seems to work. Another option would be to use DistributedDataParallel I suppose, but I want to understand why this isn’t working first.

What is the batch size of the input you pass to the nn.DataParallel wrapped model?

The input is split as evenly as possible over the devices you want to use. But if the split is odd, if you’re splitting over 3 GPUs, then it is possible for a subset of GPUs to have suboptimal performance. For example, a batch size of 11 is going to be much worse than a batch of 8. If you don’t have enough data for an even split, where every GPU gets a power-of-two sized batch, you can always fill it back up with garbage tensors, since you’re only doing inference.

It’s in the code: 2^12=4096. The model we’re using has a fairly small memory footprint and we want to use large batches to maximize GPU memory utilization for bulk scoring.

I get this behavior on 2-8 GPUs, not just 3, so the odd number of GPUs shouldn’t be a factor. Do you think I should make batch size a multiple of the number of GPUs?

Have you tried running a profiler (like nvprof) to see if there is anything preventing the GPUs from making forward progress? This would show you if there is any imbalance between the work the GPUs perform.

The problem is that although one can distribute forward-pass and not have it collect on one GPU, there is no way to distribute data across GPUs evenly in DataParallel: the batch goes on GPU0 (or one GPU of your choice), and then that batch get split into further minibatches on other GPUs; as a result GPU0 becomes the memory bottleneck - this article explains it well https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

This behavior of DataParallel isn’t an issue for large models because size(model)>size(batch), but in our case size(model)<<size(batch).

I see – perhaps the better approach then is to create your own version of nn.DataParallel that scatters straight from CPU to the right destination device. Then you don’t pay the cost of first going to GPU 0 and then scatter from there to the other GPUs.

edit: It looks like nn.DataParallel already supports this if you just keep your input on CPU.