Did you measure the data loading time during the training or just the first iteration?
Note that the first step will spin up all workers, and each will load a complete batch, which might introduce some warmup time.
Also, where is your data stored? Is it on a local SSD or some other hard drive?
Have a look at this post which gives a good summary for potential bottlenecks.