How to define the len method for PyTorch Dataloader when I have separate length datasets?

seankala · December 8, 2019, 10:27pm

I’m currently loading in my data with one single dataset class. Within the dataset, I split the train, test, and validation data separately. For example:

class Data():
    def __init__(self):
        self.load()

    def load(self):
        with open(file=file_name, mode='r') as f:
            self.data = f.readlines()

        self.train = self.data[:checkpoint]
        self.valid = self.data[checkpoint:halfway]
        self.test = self.data[halfway:]

Many of the details have been omitted for the sake of readability. Basically, I read in one big dataset and make the splits manually.

My question is arising from how to override the __len__ method when the lengths of my train, valid, and test data all differ?

The reason I want to do this is because I want to keep the split data in one single class, and I also want to create separate Dataloaders for each, and so something like:

def __len__(self):
    return len(self.train)

wouldn’t be appropriate for self.test and self.valid.

Perhaps I’m fundamentally misunderstanding the Dataloader, but how should I approach this issue? Thanks in advance.

ptrblck · December 8, 2019, 11:13pm

I’m not sure there is a clean way of handling different subsets within a single Dataset class.
If you want to handle the split yourself, I would rather create a custom function or class (if statefull) and return the corresponding dataset using e.g. TensorDataset or Subset.

E.g. this might be a usable code base:


class MyDatasetSplitter(object):
    def __init__(self, checkpoint, halfway):
        self.load()
        self.datasets = {}
        self.checkpoint = checkpoint
        self.halfway = halfway
        
    def load(self):    
        with open(file=file_name, mode='r') as f:
            self.data = f.readlines()

        self.datasets['train'] = TensorDataset(
            self.data[:self.checkpoint])
        self.dataset['valid'] = TensorDataset(
            self.data[self.checkpoint:self.halfway])
        self.datasets['test'] = TensorDataset(
            self.data[self.halfway:])
        
    def get_dataset(self, split):
        return self.datasets[split]

Would that work for your use case?
In this way, each split is a custom Dataset and can be wrapped by a DataLoader separately.

seankala · December 8, 2019, 11:44pm

Yes, this is also a method that would work. I could also just create separate dataset classes, but was wondering if there was a clean way to handle this issue. It seems not, unfortunately.

How to define the __len__ method for PyTorch Dataloader when I have separate length datasets?

How to define the len method for PyTorch Dataloader when I have separate length datasets?