Generally, torch.utils.benchmark
is a great tool to profile code, as it’s adding warmup iterations, synchronizes, etc.
For information about using cudaEvent
s for profiling, take a look at this post, which shows an example as:
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
CUDA events are of type
cudaEvent_t
and are created and destroyed withcudaEventCreate()
andcudaEventDestroy()
. In the above codecudaEventRecord()
places the start and stop events into the default stream, stream 0. The device will record a time stamp for the event when it reaches that event in the stream. The functioncudaEventSynchronize()
blocks CPU execution until the specified event is recorded. ThecudaEventElapsedTime()
function returns in the first argument the number of milliseconds time elapsed between the recording ofstart
andstop
. This value has a resolution of approximately one half microsecond.
The TL;DR: you have to synchronize the event either directly via the event object or globally via torch.cuda.synchronize()
.