Why does torch.cuda.empty_cache() make the GPU utilization near 0 and slow down the training time?

Your explanation is correct and I would not recommend to use it, unless you really need to free the cache for whatever reason (e.g. another process needs the memory).