Data loader disk and RAM access

Hello all,

Consider a MNIST dataloader with batch size 128:
train_loader = data.DataLoader( datasets.MNIST(’./data’, batch_size=128, shuffle=True, train=True, transform=transforms.ToTensor())

When a batch of 128 images is processed during training, will this data loader always need to go to the disk for fetching the next batch of 128 images into the RAM?

In case it has to go to the disk every time, how can we fetch it all at once (as MNIST is a small dataset where most of it could fit directly into RAM) and keep it in the RAM until all training iterations are complete – assuming that the training still needs batch size 128?

Thanks and Regards,

The data and target tensors in MNIST will be directly stored into the RAM, as they are quite small, as seen in these lines of code.

Thank you very much for the answer and the code reference.

Just to follow up, how is the data and target storage in RAM managed by the dataloder when the datasets become bigger e.g. CIFAR-100 (medium scale) to Imagenet (large scale). I mean what logic is used by the dataloader to figure out how much of it is to be kept in RAM, considering available memory and batch size? I somtimes need to use custom datasets (some small scale, some large), so such conceptual clarification and reference to some code parts will help me in creating an efficient dataloader for them based on their size.

It depends on the implementation of the corresponding dataset and you could check it in the source code (as done for MNIST using my link).
E.g. the often used ImageFolder dataset uses lazy loading, to save memory, while the CIFAR datasets also load the data and target into the RAM as seen here.

Custom Dataset implementations can be written in either way, since you are defining how the data is loaded.

1 Like

Thanks a lot for your fast responses. That was really helpful. Cheers!