CPU and GPU usage low during training


I’m using pytorch on an aws machine which runs on linux. It has 32 CPUs and 1 GPU.

My data is stored on the SSD of the machine and is loaded during the training in the Dataset. I am using the DataLoader class to feed data to the model during the training. I notice that neither the GPU nor the CPU are saturated. I use watch -n 0.5 nvidia-smi to monitor GPU usage and htop to monitor CPU usage. GPU usage is around 30% average. The 32 CPUs are 100% used during the very beginning of the training (maybe first batch, only a few seconds) but then only 4 or 5 are used during the rest of the training.

If the GPU is the bottleneck then it should be around 100% all the time and if the CPUs were the bottleneck I would expected the same for all 32 of them.

How come CPU usage AND GPU usage are so low simultaneously ? How can I speed up training ?
I have tried mixed precision and other stuff but my question is mostly about understanding why CPU and GPU usage are so low.

Here is my (train) dataloader :

DataLoader(dataset=visits_dataset, batch_size=256, collate_fn=pad_collate,
                              sampler=SubsetRandomSampler(train_indices), pin_memory=True, num_workers=32)

I would be very thankful for any help.

To get a clear answer you could profile the code e.g. via Nsight Systems (or the PyTorch profiler) and check the timeline to see which operations are executed.
Based on your description, I would guess that the CPU might be busy with some (single-core) work.

Thank you very much for your answer, I will profile the code.