GPU memory is in normal use, but GPU-util is 0%

Since CUDA operations are executed asynchronously, you would have to synchronize the code before starting and stopping the timer via torch.cuda.synchronize().

If your data loading is the bottleneck, have a look at this post, which explains common pitfalls and some best practices.