tl;dr
The recommended profiling methods are:
torch.profiler:
torch.utils.benchmark:
CPU-only benchmarking
CPU operations are synchronous; you can use any Python runtime profiling method like time.time()
.
CUDA benchmarking
Using time.time()
alone won’t be accurate here; it will report the amount of time used to launch the kernels, but not the actual GPU execution time of the kernel. Passing torch.cuda.synchronize()
waits for all tasks in the GPU to complete, thereby providing an accurate measure of time taken to execute.
train() # run all operations once for cuda warm-up
torch.cuda.synchronize() # wait for warm-up to finish
times = []
for e in range(epochs):
torch.cuda.synchronize()
start_epoch = time.time()
train()
torch.cuda.synchronize()
end_epoch = time.time()
elapsed = end_epoch - start_epoch
times.append(elapsed)
avg_time = sum(times)/epochs