Optimizing GPU Utilization (Minimizing epoch time)

I’m trying to understand how to profile cuda code - especially understanding bottlenecks in training loops. When I use something like lprun, which profiles lines of code in jupyter notebooks, I find that a large percentage of the time is spent at the cuda.synchronize lines which makes it difficult to identify which parts of the code are actually responsible for most of the time. Furthermore, I’d like to understand how gpu-utilization impacts total time training time for one epoch. For example, if I have multiple dataloaders that are identical except for batch size, I might notice that the time to complete one epoch is shortest at a batch size of 32, longer for batch sizes both smaller and greater than 32. However, GPU-utilization for all three loaders during training is 100%. Lastly, if GPU-utilization is not at 100%, then its likely that something with the dataloader is bottlenecking, but since data-loading is happening in parallel, I don’t really know how to figure out how to improve this problem. Would anyone care to share how I might start looking into these things? Maybe tools, or workflows?


That’s strange, epoch time should be smaller with a larger batch size. I assume you already moved all necessary tensors to the GPU before the start of an epoch. Could you provide a minimal example?

Yes, its odd…if just do a normal 224x224 batch loader for resnets and increase batch size, you get the expected behavior of increasing efficiency with larger batch size. But for my particular problem, in which there is a multi-task loss (and thus am calling next(dataloader_iterator_i) multiple times per loop for each dataloader, and many image transformations, I’m seeing a slight slow down as batch size goes to 256 from 32…maybe by about 4%. GPU utilization is still 100% to my knowledge, although I can only tell this by sampling nvidia-smi views, so it could be falling down more periodically. It would be great tho to know how to inspect whats going on under the hood with respect to the parallelism between loading and training.