Understanding DataLoader performance

I have a pretty basic DataSet and DataLoader and a relatively tiny toy model. Everything works as expected.

First of all, training seems to be dominated by data loading. Given the small model I use, I think that is somewhat plausible. Further, I noticed that having the dataset on the GPU (it fits without problems) is faster than having to on CPU and using pin_memory with multiple workers.

However, I noticed something I don’t quite understand:
Increasing the size of X_train slows down the time per microbatch significantly, e.g., going from 1m items in train to 10m means I need about 3 times as long per batch (thus 30times as long per epoch where I would have expected only a factor of 10).

I would expect some overhead, because of shuffle taking longer and because more cache misses are to be expected. But such a big factor surprises me.

Is there an explanation? What would be a good way for me to approach and debug/profile this? Is this maybe indicative of something I am doing wrong?

I figured this out myself. My targets were stored in a SparseTensor and the recommendations from here: Dataloader loads data very slow on sparse tensor - #4 by drj3122 helped me resolve the problem.

Essentially grabbing a batch from the SparseTensor scaled with the size of that Tensor and thus more training data mean slower micro batches.