Will running torch.cuda.synchronize slow down inference performance?

I’m doing inference in a loop. Think:

for batch in dl:
    outputs = model(batch)
    np.savez(outputs.cpu().numpy())

I’m trying to benchmark how long it takes for my code to do inference on a single batch.

If I add a torch.cuda.synchronize() before the np.savez line, will this slow down my inference code at all?

No, you should not see any additional slowdown by adding torch.cuda.synchronize() since pushing the CUDATensor to the CPU via outputs.cpu() will already synchronize your code.