I’m interested in building a look-up table (LUT) to estimate the latency for a given neural network architecture. The idea is that network architectures consist of discrete operations combined in a variety of ways. If we assume that the total latency of an architecture is the sum of the latency of its individual operations, we can have a simple way of estimating its latency.
My question is, what is the best way to measure the latency on a GPU? I use the following code to measure the latency of a complete end-to-end architecture. Is this synchronous code valid for single operations on GPUs as well?
def latency_on_gpu(net, input_v, N = 100):
input_v = input_v.to('cuda:0')
net.to('cuda:0')
timings = np.zeros((N,1))
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
for _ in range(10):
_ = net(input_v)
for n in range(N):
starter.record()
y = net(input_v)
ender.record()
torch.cuda.synchronize()
curr_time = starter.elapsed_time(ender)/1000
timings[n] = curr_time
return timings