Low training throughput when using small batch size

I’m using PyTorch to do regression on binary images using a UNet architecture. I’m using Adam optimizer and my training loop is very typical (iterating through a data loader, doing a forward pass, compute the loos, backward pass, and update the weights). In my experience, when I use a small batch size, my training throughput is significantly lower than when using a relatively large batch size.

To give an example to make it more concrete, at a batch size of 256 or greater, the training throughput plateaus at ~500 samples/s. If I use a smaller batch size, let’s say if I halve it, i.e., 128, the throughput halves as well at ~250 samples/s, and this trend almost linearly continues, i.e., at a batch size of 64, I roughly get ~125 samples/s, so on.

In case it’s relevant, my model roughly has 20 million parameters, and my inputs are 2D binary images with a size of 128 by 128. Also, I’m using an RTX 8000 to train the network.

Would greatly appreciate it if you could help min pin down the issue, or if that’s the expected behaviour please explain why.

I would think in terms of hardware a little. With a powerful card like yours, naturally you would want to fill it (registers, tensor cores etc.) with as much data (images) as possible and then do your binary multiplications etc. For your card, 256 or greater images might be the limit of how much data you can put into the hardware before it saturates (your hardware bottleneck). If you halve the number of images in the batch size then you are not taking the full advantage of all the resources (think not all the registers are full). In the latter case, even if your card is processing the images (multiplications, additions etc.) at full speed - you are not going to go through your entire training batch as fast as possible (your throughput will drop because you lowered the batch size).

Again I am talking purely from the hardware’s perspective.

1 Like

@tiramisuNcustard Thank you so much for your detailed reply. It totally makes sense, and I’m almost sure that this is the case, because when I monitor GPU usage using nvtop, I see almost 100% usage on batch sizes of 256 and beyond, and much less usage (I didn’t exactly monitor the numbers) when using smaller batch sizes.

The only reason I hoped to get a good performance on smaller batch sizes is that in my experience, or better say in my particular case, I get better training performance on smaller batch sizes, given a fixed number of iterations/epochs.

@ma-sadeghi, I agree with you - model performance is different from hardware (GPU) performance. You might get better model performance with different combinations of the hyperparameters (i.e. batch_size, epoch, number of nodes in the hidden layers etc.). Some combinations might give you better performance with larger batch size (a win-win from both the model and hardware performance perspectives).