Larger batch size didn't bring shorter training time

I’m new to semantic segmentation and distribute dataparallel of PyTorch.
And I run this pspnet code on my machine (2 x P100).
To my knowledge, larger batch size usually brings shorter training time. But it didn’t happen this time.
On one GPU, the training time with batch_size=8 is actually the same as (even a little longer) than batch_size=2. Specifically, the training time for 10 iterations with bs=8 is 4 times of that bs=2.
Did I make some mistake here?

GPU kernels are typically launched with block batch sizes (typically the data batch size is some multiple of warp size - 32 - since CUDA runs instructions in SIMD fashion in warps). So it is possible that between 8 and 2 the underlying CUDA kernel is actually launched with the same batch size (say 32) in which case the other rows in the batch are typically zero-filled or just discarded from the result, and you’ll see roughly the same runtime.

Can you share the runtime differences for iteration 1,2,3,…, 10,11, …? Its possible there are other operations in the pipeline that is causing the inverted performance numbers.