First iteration is 10x faster than subsequent iterations in for loop

I have a for loop in which the first iteration is substantially faster than subsequent iterations. This discrepancy only appears when I use a GPU. On a CPU every iteration takes more or less the same time. I’m wondering what causes this and whether there’s anything I can do to prevent the slow down in later iterations.

The code is here. https://colab.research.google.com/drive/1TC6khI8T44KcYCrdAkHpldw_ejRG_eO2?usp=sharing

Apologies if this is a common question or there’s a simple explanation, but I haven’t been able to find any answers.

CUDA operations are executed asynchronously, so you would have to synchronize the code before starting and stopping the timer via torch.cuda.synchronize(). Otherwise you’ll only profile the Python overhead as well as the kernel launch in the first iterations until your script encounters a blocking operation.

Ah! Thanks, things make sense after this.