Is it possible to use hook to measure latency of each blocks running on CUDA?

In this blog the author mentioned the correct way to measure latency on GPU using torch.cuda.event and proper synchronization. But this cannot give the latency profile for each block. Is it possible to register forward hook on blocks or layers to get each block’s latency? It seems that hook are not running along with the forward, but maybe before or after forward, so not sure if it can be useful for this. Thanks.