Linear relation between batch size and inference time per batch


I made the following observation which I can not follow. I am using the deeplabv3_resnet50 model from torchvision and run it in eval mode with different batch sizes. The runtime including a torch.synchronize() are almost linear with the batch size. That means that the rate of images/second is almost constant. I also tried out different settings of cudnn.benchnmark = True/False. I was hoping to see a constant inference time per batch over all batch sizes because they are running in parallel on the gpu. Am I getting anything wrong here?
2 - 0.02s
4 - 0.031s
8 - 0.05s
16 - 0.094s
32 - 0.178s

Thanks for your help.

CUDA Version: 10.0
Tesla V100

If I understand the results correctly, the throughput would increase from 100 images/s to 180 images/s, which is not almost constant.

Generally, you should see a better throughput by increasing the batch size.
However, the overall performance also depends on other potential bottlenecks in the model architecture as well as the training routine, such as data loading.

I would also recommend to try out the latest stable release, as some performance improvements were made. :wink:

You are right, it is not constant but still shouldn’t there be a larger increase in throughput if the batch size is x16 ?
I measured the time only for the forward pass through the network. So I think data loading should not be an issue.
I will try the latest release.

It depends on the utilization of the GPU and how/if the operations are already parallelizing in other dimensions than the batch size, such as the spatial dimensions of your input tensors.
If your GPU utilization is already high, you won’t see a linear speedup.

While training, you will of course use less iterations per epoch, which is another speedup, but unrelated to your use case.