The recommended profiling methods are:
CPU operations are synchronous; you can use any Python runtime profiling method like
time.time() alone won’t be accurate here; it will report the amount of time used to launch the kernels, but not the actual GPU execution time of the kernel. Passing
torch.cuda.synchronize() waits for all tasks in the GPU to complete, thereby providing an accurate measure of time taken to execute.
train() # run all operations once for cuda warm-up torch.cuda.synchronize() # wait for warm-up to finish times =  for e in range(epochs): start_epoch = time.time() train() torch.cuda.synchronize() end_epoch = time.time() elapsed = end_epoch - start_epoch times.append(elapsed) avg_time = sum(times)/epochs