In this blog the author mentioned the correct way to measure latency on GPU using torch.cuda.event
and proper synchronization. But this cannot give the latency profile for each block. Is it possible to register forward hook on blocks or layers to get each block’s latency? It seems that hook are not running along with the forward, but maybe before or after forward, so not sure if it can be useful for this. Thanks.