Flip testing as batch size 2 costs 2x time

Hi,
During testing, it is a common practice to do the flip augmentation, which is implemented by concatenating the flipped image in the batch size dimension. Usually, it is expected that the total network forward time should be less than 2 times single image forwarding (because of GPU parallel). However in my experiments the 2-bachsize forwarding is usually equal to 2x single image forwarding or even 1ms more. The GPU utilization is around 90% for both single and 2-batchsize testing. Does anyone have any ideas to accelerate the batched testing? Thanks!

Could you post a small executable code snippet so that we could have a look at it?
If you are timing your code, note that CUDA calls are asynchronous so that you would have to synchronize before starting and stopping the timer.