Each worker could create an (initial) overhead by copying the dataset to each process and could thus slow down the first iteration. Also, depending on your setup, you would have to find a sweet spot of for the number of workers, as increasing them won’t necessarily always result in a speedup as described e.g. here.
To narrow down the slowdown you could profile your code using e.g. torch.profiler
.