Increasing batch size doesn't give much speedup

I have a model of convolutional filters that I’m applying to images and 11GB of RAM on the GPU. Somehow, increasing batch size while still having things fit in memory doesn’t seem to improve the speed that much.

When I do training with batch size 2, it takes something like 1.5s per batch. If I increase it to batch size 8, the training loop now takes 4.7s per batch, so only a 1.3x speedup instead of 4x speedup.

This is also true for evaluation. Evaluating batch size 1 takes 0.04s, but batch size 4 takes 0.12s, batch size 8 takes 0.24s.

Is it reasonable to expect that increasing batch size should allow me to utilize more of the GPU RAM and hence get a speedup almost for free?

What could be things I should look at? For one, I’ve tried to make sure the tensors are contiguous in memory (so I avoid having to call .contiguous()).

1 Like

you cant expect a linear speedup by increasing batch size. the code might be parallelizing in other dimensions as well (like image H/W) and might be occupying the GPU pretty well.

Thanks for the response!

Just to clarify your comment about “occupying the GPU pretty well” – Basically, I found that when the batch size is small (e.g. 4), the GPU RAM is not fully utilized (<50% used at steady state). When I double my batch size to 8, the GPU RAM becomes almost 100% used, which makes sense since I’m doubling the batch size. However, I would have thought that this should indicate that I would get a 2x speedup, but it turns out that I don’t get that, more like 1.1x at best.

My guess is that perhaps there are other bottlenecks or the critical path in my model doesn’t actually benefit from an increased batch size, and so while peak memory usage does double, the bottlenecked portion isn’t able to use more memory to process things more quickly.