How to customized the dataloader?

John1231983 · February 5, 2019, 2:29am

Hi all, I want to create a customized dataloader that will be do as follows:

For each epoch, select a position in a given list without replacement. So, if the list has length is 10, then after 10 epochs, all positions in the list are selected

import torch.utils.data as data

class My_Dataset(data.Dataset):

    def __init__(self, data_list):
        self.data_list = data_list
    def __getitem__(self, index):
        
        position = np.random.randint (0, len(self.data_list))
        return self.data_list[position]

    def __len__(self):
        return len(self.data_list)

For example, the data_list=[1,4,5,7,8], if the first epoch selects position =4, then second epoch should select another position except 4 (because it selected in first epoch), and so on. We will permute the list when epoch bigger than data_list size

vmirly1 · February 5, 2019, 3:00am

I suppose you want to shuffle your input data. But you should not do that in the __getitem__. There are two wats to do that:

Shuffle self.data_list in the __init__ funciton
Use the DataLoader and set parameter shuffle=True; That will take care of the shuffling part.

from torch.utils.data import DataLoader
dataset = MyDataset(...)

data_loader = DataLoader(dataset,
                         batch_size=32,
                         shuffle=True,
                         num_workers=1)

John1231983 · February 5, 2019, 3:39am

Thanks. But how to make the selection position without replacements. Your code is simple, I can do it

vmirly1 · February 5, 2019, 2:05pm

Both methods I have suggested will result in random selection without-replacement.

John1231983 · February 5, 2019, 2:28pm

Sorry but you may misunderstand my question. I want to select a postion in the data_list in each epoch, such that the postion did not repeat in next epoch

vmirly1 · February 5, 2019, 2:31pm

One epoch goes through the whole data samples, right? This position that you are talking about, is it the same as the index of samples, or different?

John1231983 · February 5, 2019, 2:33pm

Yes, Each epoch goes through whole data sample but the data list is not the number of sample. Data list is likes an arbitrary array

vmirly1 · February 5, 2019, 2:36pm

Ok, one more question! Is the size of data list the same a number of samples? It has to be that way, because you want to have sampling without replacement. Is that right?

John1231983 · February 5, 2019, 2:38pm

No. the length of data list often bigger than data sample size. It may be 1000, while data samepler size is 100. It store position of ROI in the image. Base on the ROI, I can crop the image into smaller image

vmirly1 · February 5, 2019, 2:51pm

I see. I was thinking if it is smaller, but if the size is larger, then it works.

So, what you can do, in the __init__ function, create a random array of these positions like below:

    def __init__(self):
        ....
        self.position_arry = np.random.choice(1000, 1000, replace=False)

Then, in the __getitem__ function you can take a value from this array. But that will only take the first 100 elements and will ignore the rest, since the input index is always between 0 < index < 100. So, to fix this, we can change the function __len__ to return 1000 instead of 100, and then in the __getitem__ we do the fllowing:

    def __getitem__(index):
        if index > 100:
             indx_data = index % 100
        else:
             indx_data = index

        position = self.position_arry[index]
        data = .. # use indx_data to retrieve the correct sample

So, we use the index to retrieve the position, and the indx_data to retrive the sample. Also, note that the actual number of epochs is changed as well. One epoch like this corresponds to 10 epochs before.