Execution time does not decrease as batch size increases with GPUs

Hi, all.
I’m measuring wall clock time of executing a Cifar-10 sample program from Pytorch. I run the program on a virtual machine instance of GCP (Google Cloud Platform) with eight K80 GPUs. I’m testing Adam optimizer with various batch sizes.

I believe that I can increase the batch size as much as the GPUs can accept in the memories and that the GPUs perform multiplications at once within almost the same amount of time no matter how large the batch size is. This means that the one step of optimization for a batch size of 2048 takes almost the same time as 1024, resulting in halving execution time par epoch for 2048 compared to 1024.

However, one step of 2048 takes longer than 1024, and the execution time per epoch is almost the same in my experiment. I don’t understand the result.

I believe that reducing execution time by increasing the batch size is the only merit to have many GPUs. However, I haven’t had this merit in my experiment. The result was almost the same with batch size of 4096 and 8192, meaning that the execution time per epoch does not improve at all.

I understand that reducing time to reach target accuracy is important, but this is not relevant. I believe that GPUs are supposed to do many multiplications at once and are not supposed to increase the execution time as the number of multiplications increases. Would anybody please explain why this is not true in my case? Thank you in advance.

I would recommend to start by profiling the code and narrowing down the bottleneck.
E.g. if the data loading is the bottleneck, the GPUs might be starving so increasing the batch size wouldn’t help much.
You could run a quick test by using synthetic data.

Hello ptrblck. Thank you for your reply.

I understand that your advice is quite resonable, but I have to say that I’m using the optimizer as provided in Pytorch and also GCP as it is.

While I could search for the bottlenecks as you suggested, I wonder if the Pytorch platform and the sample program are designed to fully take advantage of multi-GPU settings. So, I should restate my question as follows.

Do the sample program and Pytorch codes linearly scale with batch sizes when run on multiple GPUs?
If the codes scale with no problem when used on some multi-GPU hardware, then I suppose that the multi-GPU settings of GCP might have a problem with Pytorch.

Maybe I misudetstood something, but I just wondered if it is a common practice to care about the GPU botttolnecks even when using Pytorch. Thank you.

That’s not easy to answer, as scaling not only depends on the code you are using, which might have potential bottlenecks, but also on the system especially when it comes to multiple GPUs.

As already mentioned, the data loading might be a bottleneck, which might prevent linear scaling.
E.g. if you are loading the data from a network drive or if CPU is too weak to keep up with the preprocessing pipeline. This post gives you a good overview about potential data loading bottlenecks and best practices.

For multiple GPUs, it also depends how the GPUs are communicating. We generally recommend to use DistributedDataParallel using a process for each GPU, as it should be the fastest multi-GPU setup. The peer2peer connection could use NVLink, if your server supports it, or a slower variant, which would also take part in the scaling performance. You could check the connectivity via nvidia-smi topo -m.

Thank you again ptrblck.

I only had a vage concept of data loading as just an easy task of feeding data to GPUs.
Thanks to your explanation, now I’m aware of that it involves lots of hardware aspects.
I also appreciate Ross Wightman’s writing, which you gave me a link to. The description is very informitive, and I could learn a lot about the bottleneck.

Now I suspect CPU power and storage I/O in my GCP settings. I will also try to learn how to imprement a DistributedDataParallel in my environment for the program. Thank you.