Where does DataLoader loads the data?

I am new to Pytorch and have this question.

train_loader = DataLoader(train_set, batch_size=64, shuffle=True)

I am curious about how loader loads the dataset. Does it load everything from disk to RAM, shuffle it and then extract the batches from RAM to GPU while training? or Does it shuffle the dataset in memory and store it back to the disk and extract batches from it?

You can define the data loading procedure in your Dataset:

class MyDataset(Dataset):
    def __init__(self, X, y):
        self.data = X
        self.target = y
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        
        return x, y
    
    def __len__(self):
        return len(self.data)

You could pass image paths for X and its targets for Y or already loaded Tensors, if you have enough RAM.
In the simplest usage, the DataLoader uses indices and calls __getitem__ on you Dataset.
Depending how the data is stored, it will be loaded or just extracted from RAM.
The shuffling basically is done by shuffling the indices. So your data won’t be loaded beforehand.

Note that the DataLoader has some more arguments like sampler, collate_fn etc.
My example is just a simple use case.

2 Likes

Thank you for the reply. Lets say if i use ImageVision dataset that comes pre loaded with pytorch. Are they loaded in RAM and then use indices for shuffling. Is my understanding correct?

I cannot find any ImageVision dataset. Could you post a link to it?
If it’s loaded with torch.load, it will be stored in RAM. The indices will be shuffled, if specified, and used to get the next data samples. Your understanding is correct.

Sorry that was a typo. I meant torchvision dataset from Pytorch.
Thanks.

Ah ok. So, yes, some datasets will be directly loaded into the RAM (e.g. mnist), while others will use an ImageFolder (e.g. ImageNet).