I’m training a small transformer using pytorch lightning on 2 GPUs via slurm. The GPU utilization is quite bad and depending on the num_workers I have set, each worker “works” with maximum 1/num_workers %. So when I set it to 4, I have 4 workers at 25%. The getitem method of the underlying dataset takes ~2ms, all data comes from the RAM. I’ve tested wrapping the dataset in a DataLoader locally and there it runs as expected with each worker going up to 100%, effectively parallelizing the work. But as soon as I run it via lightning, with or without slurm, below picture unfolds.
Can anyone give some hints on what the problem is? I’m happy to answer clarifying questions or to share code if needed. Thanks!
It sounds like you’re mostly concerned about your GPU utilization being low. However, you’re debugging the CPU utilization of your dataloader. Have you confirmed that the reason for low GPU utilization is data starvation? (Otherwise, it may be the case that dataloader is running as fast as it can and thus adding more threads doesn’t help and simply drops the utilization-per-thread proportionally?
Actually no. How can I do that?
you could try using the pytorch profiler (PyTorch Profiler — PyTorch Tutorials 2.1.0+cu121 documentation) to look for one class of issues: if there is time where no kernels are running on the gpu, then you are probably CPU bound. It may be the case that data-starvation causes this, but lots of other CPU code can also be running (as part of pytorch operators, or as part of your own model code)- it’s important to find which is the case before trying to optimize.
It’s also possible that you are always ‘running’ GPU kernels but the kernels are not utilizing the GPU hardware fully, due perhaps to small batch size.