Shuffling of the dataset

Carsten_Ditzel · February 1, 2019, 4:57pm

Is shuffling of the dataset performed by randomizing the access index for the getitem method or is the dataset itself shuffled in some way (which i doubt since I slice the data only in parts from an hdf5 file)

My question concerns the data access of different hdf5 datasets within the getitem method.

JuanFMontesinos · February 1, 2019, 5:12pm

The getittem function does exactly what you code it to do.
The good practice is to provide the structure of the data to be loaded in the init, for example, generating a list of files and then to code all the workload in the getittem function.

The dataset class (of pytorch) shuffle nothing. The dataloader (of pytorch) is the class in charge of doing all that.
At some point you have to return the amount of elements your data has, how many samples.
If you set shuffling, it will vary the ordering of the idx, however it’s totally agnostic to what that idx points to.

Carsten_Ditzel · February 1, 2019, 5:14pm

thank you very much!

Carsten_Ditzel · February 1, 2019, 5:22pm

bis this index zero-based like the python convention?

JuanFMontesinos · February 1, 2019, 5:38pm

Yep
Play with this

import torch
class Dataset(torch.utils.data.Dataset):
    def __len__(self):
        return 11
    def __getitem__(self, idx):
        return idx

s=Dataset()
loader= torch.utils.data.DataLoader(s,
                                     batch_size=1, shuffle=True,
                                     num_workers=0)

for i in iter(loader):
    print(i)