I am new to Pytorch and have this question.
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
I am curious about how loader loads the dataset. Does it load everything from disk to RAM, shuffle it and then extract the batches from RAM to GPU while training? or Does it shuffle the dataset in memory and store it back to the disk and extract batches from it?
You can define the data loading procedure in your
def __init__(self, X, y):
self.data = X
self.target = y
def __getitem__(self, index):
x = self.data[index]
y = self.target[index]
return x, y
You could pass image paths for
X and its targets for
Y or already loaded
Tensors, if you have enough RAM.
In the simplest usage, the
DataLoader uses indices and calls
__getitem__ on you
Depending how the data is stored, it will be loaded or just extracted from RAM.
The shuffling basically is done by shuffling the indices. So your data won’t be loaded beforehand.
Note that the
DataLoader has some more arguments like
My example is just a simple use case.
Thank you for the reply. Lets say if i use ImageVision dataset that comes pre loaded with pytorch. Are they loaded in RAM and then use indices for shuffling. Is my understanding correct?
I cannot find any ImageVision dataset. Could you post a link to it?
If it’s loaded with
torch.load, it will be stored in RAM. The indices will be shuffled, if specified, and used to get the next data samples. Your understanding is correct.
Sorry that was a typo. I meant torchvision dataset from Pytorch.
Ah ok. So, yes, some datasets will be directly loaded into the RAM (e.g. mnist), while others will use an
ImageFolder (e.g. ImageNet).