If you are using multiple workers, the Dataset will be copied, if I’m not mistaken.
The first iteration would include these copies as well as the first batch creation in each process, which might be slow.
However, the following iterations should be faster.
Are you consistently seeing a slowdown using num_workers>=1 compared to num_workers=0?
Generally, this post is really helpful when it comes to data loading bottlenecks, but your issue seems to be unrelated.
Are you storing the data on a local SSD or somewhere else?