Dataloaders and performance

I’m curious to hear whether other people have managed to get satisfactory performance out of the dataloaders, especially for small networks.

Right now I’m testing the dataloader on CIFAR10, with an autoencoder with only 200k parameters. For this test I have all the images saved individually on my disk. And I can’t find any way of getting good performance for this setup, even though this seems to be exactly how the dataloader was meant to be used.
I have tried, pinning the memory, increasing the number of workers, yet nothing gives anywhere near satisfactory performance.

The below shows how the dataloader takes up almost all the time

iteration 0, running loss = 1.252, dataloader = 43.533, rest = 1.09625
iteration 1, running loss = 2.524, dataloader = 8.096, rest = 0.01700
iteration 2, running loss = 3.898, dataloader = 0.000, rest = 0.01100
iteration 3, running loss = 5.249, dataloader = 0.000, rest = 0.01200
iteration 4, running loss = 6.518, dataloader = 0.000, rest = 0.01000
iteration 5, running loss = 7.626, dataloader = 0.000, rest = 0.01000
iteration 6, running loss = 8.763, dataloader = 0.000, rest = 0.01100
iteration 7, running loss = 10.082, dataloader = 0.000, rest = 0.01000
iteration 8, running loss = 11.396, dataloader = 0.001, rest = 0.01200
iteration 9, running loss = 12.645, dataloader = 0.000, rest = 0.01100
iteration 10, running loss = 13.782, dataloader = 21.625, rest = 0.02901
iteration 11, running loss = 14.988, dataloader = 8.075, rest = 0.01200
iteration 12, running loss = 16.227, dataloader = 0.000, rest = 0.01200
iteration 13, running loss = 17.452, dataloader = 0.001, rest = 0.01100
iteration 14, running loss = 18.627, dataloader = 0.000, rest = 0.01200
iteration 15, running loss = 19.708, dataloader = 0.001, rest = 0.01200
iteration 16, running loss = 20.866, dataloader = 0.001, rest = 0.01100
iteration 17, running loss = 21.927, dataloader = 0.000, rest = 0.01400
iteration 18, running loss = 23.017, dataloader = 0.000, rest = 0.01200
iteration 19, running loss = 24.100, dataloader = 0.000, rest = 0.01200
iteration 20, running loss = 25.212, dataloader = 24.793, rest = 0.02601
iteration 21, running loss = 26.250, dataloader = 3.453, rest = 0.01100
iteration 22, running loss = 27.317, dataloader = 0.000, rest = 0.01000
iteration 23, running loss = 28.383, dataloader = 0.001, rest = 0.01000
iteration 24, running loss = 29.510, dataloader = 0.000, rest = 0.01100
iteration 25, running loss = 30.518, dataloader = 0.000, rest = 0.01000
iteration 26, running loss = 31.501, dataloader = 0.000, rest = 0.01100
iteration 27, running loss = 32.520, dataloader = 0.000, rest = 0.01100
iteration 28, running loss = 33.505, dataloader = 0.000, rest = 0.01000
iteration 29, running loss = 34.444, dataloader = 0.000, rest = 0.01100
iteration 30, running loss = 35.282, dataloader = 28.226, rest = 0.02601
iteration 31, running loss = 36.117, dataloader = 1.934, rest = 0.01000
iteration 32, running loss = 37.011, dataloader = 0.000, rest = 0.01100
iteration 33, running loss = 37.888, dataloader = 0.000, rest = 0.01000
iteration 34, running loss = 38.872, dataloader = 0.001, rest = 0.01000
iteration 35, running loss = 39.714, dataloader = 0.001, rest = 0.01000
iteration 36, running loss = 40.574, dataloader = 0.000, rest = 0.01000
iteration 37, running loss = 41.512, dataloader = 0.000, rest = 0.01000
iteration 38, running loss = 42.433, dataloader = 0.000, rest = 0.01000
iteration 39, running loss = 43.300, dataloader = 0.001, rest = 0.01000

Has anyone managed to get decent performance like this?

My own guess at a solution is to load the entire CIFAR dataset into memory at the start since the dataset is so small that I can do that, and then just transform images from memory. I’m just not sure how to do this with the current dataloader.

If you use the CIFAR10 dataset from torchvision, the data will be loaded into the memory as you suggested (line of code).

I’m not sure, which disc you are using, but have a look at this post for some information on potential bottlenecks.

The problem was that I was the approach taken in mean_teacher, which is to have each image saved individually as a file, which is loaded each time the dataloader is called. That was of course horribly inefficient for small networks. The problem was solved by moving to torchvision as suggested.