Loading data costs about 75% of the total time consumption?

I am working on training the ImageNet dataset on RseNet18, following the official example.
As is recommended, I spilt one batch into 4 GPUs, with each one containing 128 samples(128 per GPU, 512 totally). Surprisingly, loading data costs about 75%(1.7s/2.2s) of the total time consumption and the time cost between different iterations varies hugely, i.e., 17s vs. 0.5s. It seems that the GPUs are waiting for new data batches (GPU utilization goes down to 0%), even though I run 16 workers to load data.

So, I am wondering is there any thing wrong for the settings? Too large batchsize or too many threads to load data?