Clarification on torch.cuda.synchronize() and CPU-GPU Synchronization

Hello everyone,

I have a question regarding torch.cuda.synchronize() and how the CPU interacts with GPU tasks in PyTorch.

I’ve already read this thread: How to measure execution time in PyTorch?, but I still have some doubts about how synchronization works.

It is commonly stated that if you issue a command for the GPU, the CPU will continue executing without waiting for the GPU to finish its process. I wanted to explore this behavior using two similar code snippets:

Code 1

import torch

device = "cuda:0"
size = 2**22
x = torch.arange(0, 2**20) / size + 1j * torch.arange(0, 2**20) / size
x = x.to(device)
X = 0  # Initial value that will be overwritten
X = torch.fft.fft(x)  # This operation runs asynchronously on the GPU
y = X + 1  # CPU continues execution and should use the initial X value if it runs too soon
print(y)  # The result is printed. Should be 1 because fft is heavy and takes time

Code 2

import torch

device = "cuda:0"
size = 2**22
x = torch.arange(0, 2**20) / size + 1j * torch.arange(0, 2**20) / size
x = x.to(device)
X = torch.fft.fft(x)  # This operation runs asynchronously on the GPU
torch.cuda.synchronize()  # Explicit synchronization with the GPU
y = X + 1  # Now the CPU waits for the GPU to finish and will use the X from fft computation
print(y)  # The result is printed

Based on the explanation in the discussion thread, code 1 should print y = 1 since the CPU proceeds without waiting for the GPU result, and code 2 should print y = X + 1, given that synchronization is forced. However, when I run both codes, they produce the same correct result (i.e., y = X + 1 in both cases).

My question is:

What mechanism ensures that the CPU in code 1 does not print an incorrect result (like y = 1), even though torch.cuda.synchronize() is not explicitly called? How does the CPU “know” to wait for the GPU result before proceeding to the next line?

No, it shouldn’t since y = X + 1 will schedule another CUDA kernel on the same default CUDAStream and will thus not create a race condition as you expect.

CPU has reached the y = X + 1 line before X = torch.fft.fft(x) finished.

How does the CPU knows to launch a CUDA kernel on y = X + 1 since for the CPU X = 0 which is an integer not a CUDA variable ?

That’s not the case since X is the return value of a CUDA kernel and PyTorch will thus schedule the launch of the next CUDA operation using this tensor as its input.