The inference time should not increase if you lower the batch size, so something looks wrong.
Are you using your GPU? If so, note that CUDA kernels are executed asynchronously so you would need to synchronize the code before starting and stopping host timers via torch.cuda.synchronize().
First of all, thank you for your prompt response. I understood what you said. Using torch.cuda.synchronize() doesn’t help in speeding up. My GPU looks like active.==>
I didn’t mean to claim synchronizations would speed up your code, but should be used if you want to profile the actual GPU execution.
I.e. you should add a synchronization via torch.cuda.synchronize() before using host timers via e.g. time.perf_counter().