Execution time does not decrease as batch size increases with GPUs

That’s not easy to answer, as scaling not only depends on the code you are using, which might have potential bottlenecks, but also on the system especially when it comes to multiple GPUs.

As already mentioned, the data loading might be a bottleneck, which might prevent linear scaling.
E.g. if you are loading the data from a network drive or if CPU is too weak to keep up with the preprocessing pipeline. This post gives you a good overview about potential data loading bottlenecks and best practices.

For multiple GPUs, it also depends how the GPUs are communicating. We generally recommend to use DistributedDataParallel using a process for each GPU, as it should be the fastest multi-GPU setup. The peer2peer connection could use NVLink, if your server supports it, or a slower variant, which would also take part in the scaling performance. You could check the connectivity via nvidia-smi topo -m.