Low Volatile GPU-util

I am training a small model with 256 kb parameters. And I set the dataset loader with num_works = 16, pin_memory = True and batch_size = 128. However, the GPU-util is still very low with 0.007%. Could you give any advice to improve the GPU-util?

Is that the batch size that is the maximum possible? Have you profiled or timed parts of the code to see what the bottleneck is?