What is the correct way to measure the total execution time for a pytorch function running on GPU?

Following is an example code showing what I am trying to measure. Here I am using time.perf_counter() to measure time. Is this the correct way to measure execution time in this scenario? If not, what is the correct way? My concern is, GPU evaluations are asynchronous and GPU execution might not be completed when ExecTime is measured below.

import torch
import torch.nn.functional as F
import time

Device = torch.device("cuda:0")
ProblemSize = 100
NumChannels = 5
NumFilters = 96
ClassType = torch.float32

X = torch.rand(1, NumChannels, ProblemSize, ProblemSize, dtype=ClassType).to(Device)
weights = torch.rand(NumFilters, NumChannels, 10, 10, dtype=ClassType).to(Device)
    
#warm up
Y = F.conv2d(X, weights)
Y = F.conv2d(X, weights)

#time
t = time.perf_counter()
Y = F.conv2d(X, weights)
ExecTime = time.perf_counter() - t

As you assume, you would have to synchronize the code before starting and stopping the timer via torch.cuda.synchronize().

I would also recommend to add some warm-up iterations as well as time your operations in a loop and take the average execution time.