Hello,
I had to run some inference speed tests since I was getting weird results. I observed that for batch size 1 the model’s inference was 4 times higher than with batch size of 10.
Below is the code for the inference test.
pose = HRNet(32,17)
pose.load_state_dict(torch.load('pose_hrnet_w32_256x192.pth'))
# pose = TRTModule()
# pose.load_state_dict(torch.load('int8_hrnet.pth'))
pose.cuda().eval()
device = torch.device("cuda")
dummy_input = torch.randn(1, 3,256,192, dtype=torch.float).to(device)
repetitions=1000
total_time = 0
with torch.no_grad():
for rep in range(repetitions):
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
starter.record()
_ = pose(dummy_input)
ender.record()
torch.cuda.synchronize()
curr_time = starter.elapsed_time(ender)/1000
total_time += curr_time
Throughput = repetitions/total_time
print('Final Throughput:',Throughput)
In this case the fps was 27, while in the case of batch 10 for 100 reps was 110.
Keep in mind that the same test with a different method (VitPose) yields the same result regardless of the batch size.
Is this something that is possible , i.e. higher batch size to lead to lower latency?