I’m trying to understand how to profile cuda code - especially understanding bottlenecks in training loops. When I use something like lprun, which profiles lines of code in jupyter notebooks, I find that a large percentage of the time is spent at the cuda.synchronize lines which makes it difficult to identify which parts of the code are actually responsible for most of the time. Furthermore, I’d like to understand how gpu-utilization impacts total time training time for one epoch. For example, if I have multiple dataloaders that are identical except for batch size, I might notice that the time to complete one epoch is shortest at a batch size of 32, longer for batch sizes both smaller and greater than 32. However, GPU-utilization for all three loaders during training is 100%. Lastly, if GPU-utilization is not at 100%, then its likely that something with the dataloader is bottlenecking, but since data-loading is happening in parallel, I don’t really know how to figure out how to improve this problem. Would anyone care to share how I might start looking into these things? Maybe tools, or workflows?