GPU memory is in normal use, but GPU-util is 0%

A simple test would be to use random tensor inputs instead of loading and processing the data from your SSD.
You could also profile the data loading time as shown in the ImageNet example. The timer should approach a zero loading time, if the workers are fast enough to create the next batch while the GPU is busy with the training.