Building a Look-up Table for Latency

I’m interested in building a look-up table (LUT) to estimate the latency for a given neural network architecture. The idea is that network architectures consist of discrete operations combined in a variety of ways. If we assume that the total latency of an architecture is the sum of the latency of its individual operations, we can have a simple way of estimating its latency.

My question is, what is the best way to measure the latency on a GPU? I use the following code to measure the latency of a complete end-to-end architecture. Is this synchronous code valid for single operations on GPUs as well?

def latency_on_gpu(net, input_v, N = 100):
    input_v = input_v.to('cuda:0')
    net.to('cuda:0')
    timings = np.zeros((N,1))
    starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True) 
    for _ in range(10):
        _ = net(input_v)
    for n in range(N):
        starter.record()
        y = net(input_v)
        ender.record()
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)/1000
        timings[n] = curr_time
    
    return timings