First call to torch.rand consistently slower than subsequent calls

I did a simple experiment shown below (1.13.1+cu117). The first call to torch.rand is consistently slower than the subsequent calls. I benchmarked on both CUDA (top) and CPU (bottom).

On CPU, the effect disappears, though the first torch.sum is consistently slower than the rest.
I am profiling like so, in case there is something wrong with my profiling procedure.

CUDA_LAUNCH_BLOCKING=1 kernprof -lv && python -m line_profiler

Could someone explain what goes on internally that makes this first call so much slower than the second on CUDA but not CPU? And, why on CPU the first call to torch.sum is still much slower than the subsequent calls?

I am told the effect disappears on PyTorch 2.0. Why is this the case? (I also asked on Twitter here, there is some discussion but not a definite answer)

I don’t know how kernprof profiles the workload but CUDA is initialized lazily and an overhead in the first calls is expected. This is due to:

  • the initialization of the CUDA driver and the CUDA context
  • starting with CUDA 11.7 lazy module loading is enabled by default, which will load the needed CUDA kernel into the CUDA context in its first call to save GPU memory
  • memory allocations via cudaMalloc are slow and synchronizing, which is why PyTorch uses a caching mechanism to reuse memory if possible.
1 Like

Cool, thanks a lot! Just a few follow-up questions and clarifications:

  • The initialization of the CUDA driver + context should be included in the torch.cuda.init() line, so per the first image this accounts for 0.26s of the slowness, is that right?
  • For the second point, basically you mean that the kernels needed to create a random 10k x 10k matrix are loaded only in the first call but not the rest. Can you explain how that saves GPU memory?
  • For your third point, this only happens on CUDA and not on CPU?

Is there a reasonable way to profile the magnitude of the effects of the second and third points (not necessarily with kernprof), e.g. seeing how much time is spent the correct CUDA kernel into the CUDA context?

I don’t know how to interpret your profiler outputs and since CUDA ops are executed asynchronously, I would be careful.
The proper way would be to use e.g. Nsight Systems as described here and profile the entire workload which would show the overhead in the first calls.