Following is an example code showing what I am trying to measure. Here I am using time.perf_counter() to measure time. Is this the correct way to measure execution time in this scenario? If not, what is the correct way? My concern is, GPU evaluations are asynchronous and GPU execution might not be completed when ExecTime is measured below.
import torch
import torch.nn.functional as F
import time
Device = torch.device("cuda:0")
ProblemSize = 100
NumChannels = 5
NumFilters = 96
ClassType = torch.float32
X = torch.rand(1, NumChannels, ProblemSize, ProblemSize, dtype=ClassType).to(Device)
weights = torch.rand(NumFilters, NumChannels, 10, 10, dtype=ClassType).to(Device)
#warm up
Y = F.conv2d(X, weights)
Y = F.conv2d(X, weights)
#time
t = time.perf_counter()
Y = F.conv2d(X, weights)
ExecTime = time.perf_counter() - t