I’m currently loading in my data with one single dataset class. Within the dataset, I split the train, test, and validation data separately. For example:
class Data():
def __init__(self):
self.load()
def load(self):
with open(file=file_name, mode='r') as f:
self.data = f.readlines()
self.train = self.data[:checkpoint]
self.valid = self.data[checkpoint:halfway]
self.test = self.data[halfway:]
Many of the details have been omitted for the sake of readability. Basically, I read in one big dataset and make the splits manually.
My question is arising from how to override the __len__ method when the lengths of my train, valid, and test data all differ?
The reason I want to do this is because I want to keep the split data in one single class, and I also want to create separate Dataloaders for each, and so something like:
def __len__(self):
return len(self.train)
wouldn’t be appropriate for self.test and self.valid.
Perhaps I’m fundamentally misunderstanding the Dataloader, but how should I approach this issue? Thanks in advance.
I’m not sure there is a clean way of handling different subsets within a single Dataset class.
If you want to handle the split yourself, I would rather create a custom function or class (if statefull) and return the corresponding dataset using e.g. TensorDataset or Subset.
Yes, this is also a method that would work. I could also just create separate dataset classes, but was wondering if there was a clean way to handle this issue. It seems not, unfortunately.