How to load dataset more efficient

xin_du · June 28, 2022, 9:48pm

I created a dataset class by myself, but I need to change the labels of the data points for different iterations. Using for loop to go through each point will be quite slow as it needs to be repeated several times in the training process.

I tried to use dataset[:][‘labels’] to revalue the labels, but the error said TypeError: join() argument must be str or bytes, not ‘Series’. For the data loader, it cannot extract all the labels without the iteration too. Do you have any solution for solving it more efficiently? Thanks

Matias_Vasquez · June 29, 2022, 6:40am

If you created your own Dataset, then you could do this lazily.

You could define a variable inside your Dataset that keeps track of what iteration it is, and according to it you could change what the returned label is. If you want, you can also add methods to update this variable, or change it directly when your epoch/iteration is done.

Here is a small example of how you could do it. Then, when you call __getitem__ you will get the corresponding label. In this example the label is the same as self.iteration, but you would need to change it to whatever it is that you need.

# Dataset example

class MyCustomDataset(torch.utils.data.Dataset):
    def __init__(self, *args, **kwargs):
        self.data = list(range(10))
        self.iteration = 0

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data_point = self.data[idx]

        # Here comes your logic to change the label according to the iteration
        label = self.iteration

        return data_point, label

    def change_iteration(self, iteration):
        self.iteration = iteration
      
    def reset_iteration(self):
        self.iteration = 0

# Small usage example

ds = MyCustomDataset()
dl = torch.utils.data.DataLoader(ds, batch_size=10)

for epoch in range(3):
    ds.change_iteration(epoch)
    for i, (data, lbl) in enumerate(dl):
        print(data, lbl)

# Output

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

xin_du · June 29, 2022, 8:09am

Thanks, it is helpful. I have a question that the getitem function that will be called in every epoch for changing all labels. Is that an efficient way to go through labels compared to changing labels outside the dataset class where we only need to go through all the labels in the first epoch (if the iteration means the run time of training a models instead of epochs)? Thanks

Matias_Vasquez · June 29, 2022, 9:35am

If you have created a Dataset class similar to mine, then every time you iterate through it with a DataLoader you will get the items through the __getitem__ function.

This means that adding this little bit of logic should not really affect the performance. On the other side, if every time you want to change the label you iterate through your entire dataset, then you will be doing double the work (iterating once to change the label and iterating once to get the item).

What we are doing at the beginning of each epoch (ds.change_iteration(epoch)) is only changing one number inside the Dataset, not going through every datapoint and changing its label. The label is later determined when getting each object.

If there is something not clear please let me know.

But it might help if you share the relevant part of your code or a reproducible example of what you want.