CUDA manual startup

Consider the following snippet:

from time import time
import torch

for run in range(5):
    start = time()
    torch.empty(1, device="cuda")
    end = time()
    print(f"{run}: {end - start:.2g} seconds")

On my machine this results in the following output:

0: 1.6 seconds
1: 3e-05 seconds
2: 7.9e-06 seconds
3: 7.4e-06 seconds
4: 7.2e-06 seconds

It is pretty clear that the first time a tensor a moved to the GPU some startup or initialization is happening. Is it possible to manually do this upfront?

I’m asking this since I have multiple tests in my test suite that require CUDA. Depending on which test is run first, it indirectly performs the startup is therefore slow. If have a flag to skip slow tests, but the test in itself is not slow.

The first CUDA call initializes the CUDA context, which is this slow.
You might want to add an init method, which calls into torch.cuda.init() or any random CUDA tensor creation.

Also note, that you are not timing the tensor creation in this current code snippet, since you are not synchronizing the code. The 1e-6 times might thus profile the Python overhead from the kernel launch.

facpalm. Sorry, I’m not sure how I missed torch.cuda.init() in the docs.

About the timing: this was just my quick and dirty method to show the difference between the initial startup and the following runs. I do not use this to actually profile my code.

1 Like

@ptrblck Are you sure torch.cuda.init() is the way to go here? I’ve added it before the loop and this has a negligible effect on the first run.

1 Like

You are right. I only see the smaller ~3MB allocation, but not the creation of the CUDA context.
In that case you might need to create a dummy tensor instead in the init method. :confused: