In the doc of pytorch profiler, it suggests that you can wrap each line of your code into a context manager to find the bottleneck of your code.
However, I’m wondering whether the asynchronous nature of cuda will cause inaccurate measurement in this case? Should we call torch.cuda.synchronize() before entering and leaving the context?