I have a pretty basic DataSet and DataLoader and a relatively tiny toy model. Everything works as expected.
First of all, training seems to be dominated by data loading. Given the small model I use, I think that is somewhat plausible. Further, I noticed that having the dataset on the GPU (it fits without problems) is faster than having to on CPU and using pin_memory with multiple workers.
However, I noticed something I don’t quite understand:
Increasing the size of X_train slows down the time per microbatch significantly, e.g., going from 1m items in train to 10m means I need about 3 times as long per batch (thus 30times as long per epoch where I would have expected only a factor of 10).
I would expect some overhead, because of shuffle taking longer and because more cache misses are to be expected. But such a big factor surprises me.
Is there an explanation? What would be a good way for me to approach and debug/profile this? Is this maybe indicative of something I am doing wrong?