I’m monitoring my GPU usage as I’m doing training.
What I observe is, it fluctuates between 93% and 44%. Is this because, at some points, it waits CPU to pass data? Or is it because I do a pass over validation set (which is considerably smaller than the training set) after training ?
In any case, does these fluctuations mean I’m under-utilizing my GPU? If that’s the case, what I am doing wrong? For example, I tried to increase num_workers on my data loaders but it didnt have any effect on the usage pattern.
93% is excellent utilization and I believe having lower GPU utilization during validation is expected as you don’t compute gradients or make parameter updates so the process is a lot more data-intensive
I’d suggest profiling your DataLoader vs the training step as the first step to figure out where the bottleneck actually is.
Thanks for the answer. So, it makes sense to GPU usage to dip during validation for the reasons you specified. In that case, since training and validation are independent, is there a way to force GPU to start the training of next epoch while doing validation of the previous epoch ?