Loop the same image 5 times for predict, the first loop is much slower than others

Add them before starting and stopping the timers via:

torch.cuda.synchronize()
t0 = time.perf_counter()

or use torch.utils.benchmark, which will synchronize and add warmup iterations.