Peculiar Inference speed test

Hello,

I had to run some inference speed tests since I was getting weird results. I observed that for batch size 1 the model’s inference was 4 times higher than with batch size of 10.
Below is the code for the inference test.

pose = HRNet(32,17)
pose.load_state_dict(torch.load('pose_hrnet_w32_256x192.pth'))

# pose = TRTModule()
# pose.load_state_dict(torch.load('int8_hrnet.pth'))
pose.cuda().eval()
device = torch.device("cuda")
dummy_input = torch.randn(1, 3,256,192, dtype=torch.float).to(device)
repetitions=1000
total_time = 0
with torch.no_grad():
    for rep in range(repetitions):
        starter, ender = torch.cuda.Event(enable_timing=True),   torch.cuda.Event(enable_timing=True)
        starter.record()
        _ = pose(dummy_input)
        ender.record()
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)/1000
        total_time += curr_time
Throughput =   repetitions/total_time
print('Final Throughput:',Throughput)

In this case the fps was 27, while in the case of batch 10 for 100 reps was 110.
Keep in mind that the same test with a different method (VitPose) yields the same result regardless of the batch size.
Is this something that is possible , i.e. higher batch size to lead to lower latency?