Why data loading performance degradation between Pytorch 1.12 and 2.6.0?

I have a test exercise of Alexnet on cifar10 with the standard torch Dataloader and dataset. It seemed to optimize at 12-14 sec per epoch by loading the dataset into a RAM disk, pin_memory=True, and num_workers=8. nvidia-smi showed very little GPU downtime and was pretty much pegged at to 90-100% utilization while running.
That was Python 3.11 with the NVIDIA 535 drivers using an rtx 3090ti with OS=22.04 Ubuntu, Pytorch 1.12.1.post201 and Torchvision 0.13.0a0+8069656.

Under Python 3.12 and NVIDIA 560 drivers, Ptorch 2.6.0+cu124 and Torchvision 0.21.0+cu124, the same hardware and OS, I ran the same program and it is taking 49 sec per epoch. That’s a 3x increase! nvidia-smi shows the GPU mostly idling.
I am not sure what changed in Python between 3.11 and 3.12, the NVIDIA driver or whatever else, but I cannot seem to get back to 12-14 sec per epoch.

I lowered the num_workers from 8 to 2 and got down to 27 sec per epoch. But that is still a 2x increase from 14 sec. The GPU is still mostly idling.

I do notice a difference in the loading of the data.
For the first call (upon entering the loop) of
“for i, (images, labels) in enumerate(train_loader):”
For num_workers=8 takes 18 seconds, 0.02 sec for the rest.
For num_workers=2 takes 5 seconds, 0.01 for the rest.
Same for the validation loop.

The only other thing I can think of is that the CPU utilization shows a large amount of time with most of the 32 threads being used at 5-7% for 18 seconds at a time corresponding approximately to the first call in the loops. That and the cache memory goes up, and then goes back down after the epoch is trained.

Might anyone know what changed to cause such a drastic performance reducton? And, perhaps, how to get that performance back?

This performance degradation may be due a change in the way page cache is handled between torch versions and the O/S. Because the data are already stored in a RAM disk, copying it to cache causes needless delay.
What may be the cache issue is shown in the picture where all the CPUs are periodically being utilized during the training periods where the GPU is idling.
I have been unable to find a way to have Dataloader bypass Pagecache. Can anyone help?

Since the program settings were identical between the two versions, I am going to guess it was my not realizing that the default torch Dataloader interaction with the O/S and cache system seems to have changed between the two versions.

It seems that the Dataloader is more dependent upon cache, to the point that keeping a dataset in RAM is redundant with the current version.

Playing around with the pin_memory, prefetch_factor, persistent_workers and num_workers permitted me to get most of the performance back.

NOTE: I still get the caching behavior shown in the performance chart above, but it is only for the first epoch. Subsequent epochs don’t have it, so I am guessing it is reading from cache.

I could see an advantage of being able to skip cache and write directly to GPU when the data are already in RAM. But at this point those are likely edge cases.