How does torch.cuda.synchronize() behave?

According to the PyTorch documentation torch.cuda.synchronize “Waits for all kernels in all streams on a CUDA device to complete.”. Questions:

  1. Should this say “Waits for all kernels in all streams initiated by this Python session on a CUDA device to complete”? In other words, if Python session A is running CUDA operations, and I call torch.cuda.synchronize() in Python session B, that won’t care about what’s happening in Python session A right?

  2. Surely if we don’t call torch.cuda.synchronize(), but try to work with any python code referencing the tensors in the computation graph, then it’s like implicitly calling it right?

Q2 in code:

output = model(inputs)  # cuda starts working here
a = 1 + 1  # cuda might still be running the previous line. This line can run at the same time
other_model(output) # This implicitly does the same thing as torch.cuda.synchronize() then does a forward pass of other_model
b = a + a  # This line can't happen until cuda is done and the previous line has been executed
  1. Work of independent processes should be serialized (CUDA MPS might be the exception). Process A doesn’t know anything about process B, so a synchronize() (or cudaDeviceSynchronize) call would synchronize the work of the current process. However, if process B uses the GPU for a display output etc. you might see a latency increase depending when the context switch occurs.

  2. It depends what you are trying to do in your Python code. E.g. if you are only scheduling work, no synchronizations are needed and won’t be added. On the other hand, if you are e.g. printing a value of a CUDATensor, then an implicit sync is added since the value has to be computed and pushed to the CPU first.

2 Likes

Thanks so much for picking this up @ptrblck ! That clarifies some things for me. So a related follow up: in various blogs/snippets showing how to time a model forward pass correctly I’ve seen a pattern like:

starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
with torch.inference_mode():
    starter.record()
    model.inference(inp)
    ender.record()
    torch.cuda.synchronize()
total_time += starter.elapsed_time(ender)

Although, at least once I’ve also seen it with the ender.record() and torch.cuda.synchronize() swapped. I can see why someone might think you should swap them. Maybe they consider that ender.record() might run before the inference is done if synchronization is not done first, following the logic in this snippet. Although I’m wondering if ender.record() behaves differently from time.time(). Long question short: which way is correct and why?

Generally, torch.utils.benchmark is a great tool to profile code, as it’s adding warmup iterations, synchronizes, etc.
For information about using cudaEvents for profiling, take a look at this post, which shows an example as:

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

CUDA events are of type cudaEvent_t and are created and destroyed with cudaEventCreate() and cudaEventDestroy(). In the above code cudaEventRecord() places the start and stop events into the default stream, stream 0. The device will record a time stamp for the event when it reaches that event in the stream. The function cudaEventSynchronize() blocks CPU execution until the specified event is recorded. The cudaEventElapsedTime() function returns in the first argument the number of milliseconds time elapsed between the recording of start and stop. This value has a resolution of approximately one half microsecond.

The TL;DR: you have to synchronize the event either directly via the event object or globally via torch.cuda.synchronize().

1 Like

Thanks @ptrblck this answers my follow up question.

1 Like

For the interested readers.

Simple time-based benchmarks also benefit from torch.cuda.synchronize(): Google Colab