I had to run some inference speed tests since I was getting weird results. I observed that for batch size 1 the model’s inference was 4 times higher than with batch size of 10.
Below is the code for the inference test.
pose = HRNet(32,17) pose.load_state_dict(torch.load('pose_hrnet_w32_256x192.pth')) # pose = TRTModule() # pose.load_state_dict(torch.load('int8_hrnet.pth')) pose.cuda().eval() device = torch.device("cuda") dummy_input = torch.randn(1, 3,256,192, dtype=torch.float).to(device) repetitions=1000 total_time = 0 with torch.no_grad(): for rep in range(repetitions): starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True) starter.record() _ = pose(dummy_input) ender.record() torch.cuda.synchronize() curr_time = starter.elapsed_time(ender)/1000 total_time += curr_time Throughput = repetitions/total_time print('Final Throughput:',Throughput)
In this case the fps was 27, while in the case of batch 10 for 100 reps was 110.
Keep in mind that the same test with a different method (VitPose) yields the same result regardless of the batch size.
Is this something that is possible , i.e. higher batch size to lead to lower latency?