Shuffling of the dataset

Is shuffling of the dataset performed by randomizing the access index for the getitem method or is the dataset itself shuffled in some way (which i doubt since I slice the data only in parts from an hdf5 file)

My question concerns the data access of different hdf5 datasets within the getitem method.

The getittem function does exactly what you code it to do.
The good practice is to provide the structure of the data to be loaded in the init, for example, generating a list of files and then to code all the workload in the getittem function.

The dataset class (of pytorch) shuffle nothing. The dataloader (of pytorch) is the class in charge of doing all that.
At some point you have to return the amount of elements your data has, how many samples.
If you set shuffling, it will vary the ordering of the idx, however it’s totally agnostic to what that idx points to.

1 Like

thank you very much!

bis this index zero-based like the python convention?

Yep
Play with this :slight_smile:

import torch
class Dataset(torch.utils.data.Dataset):
    def __len__(self):
        return 11
    def __getitem__(self, idx):
        return idx

s=Dataset()
loader= torch.utils.data.DataLoader(s,
                                     batch_size=1, shuffle=True,
                                     num_workers=0)

for i in iter(loader):
    print(i)
2 Likes