I am new to Pytorch and have this question.
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
I am curious about how loader loads the dataset. Does it load everything from disk to RAM, shuffle it and then extract the batches from RAM to GPU while training? or Does it shuffle the dataset in memory and store it back to the disk and extract batches from it?
1 Like
You can define the data loading procedure in your Dataset
:
class MyDataset(Dataset):
def __init__(self, X, y):
self.data = X
self.target = y
def __getitem__(self, index):
x = self.data[index]
y = self.target[index]
return x, y
def __len__(self):
return len(self.data)
You could pass image paths for X
and its targets for Y
or already loaded Tensor
s, if you have enough RAM.
In the simplest usage, the DataLoader
uses indices and calls __getitem__
on you Dataset
.
Depending how the data is stored, it will be loaded or just extracted from RAM.
The shuffling basically is done by shuffling the indices. So your data won’t be loaded beforehand.
Note that the DataLoader
has some more arguments like sampler
, collate_fn
etc.
My example is just a simple use case.
3 Likes
Thank you for the reply. Lets say if i use ImageVision dataset that comes pre loaded with pytorch. Are they loaded in RAM and then use indices for shuffling. Is my understanding correct?
I cannot find any ImageVision dataset. Could you post a link to it?
If it’s loaded with torch.load
, it will be stored in RAM. The indices will be shuffled, if specified, and used to get the next data samples. Your understanding is correct.
Sorry that was a typo. I meant torchvision dataset from Pytorch.
Thanks.
Ah ok. So, yes, some datasets will be directly loaded into the RAM (e.g. mnist), while others will use an ImageFolder
(e.g. ImageNet).