Question regarding dataloading class

Hi! I was wondering if there is a way to have the data loading class being able to grab data from a somewhat complex folder structure. the main issue is that getitem seems to have one input defined as idx and getlen only stores one length value, for example, a folder structure like this seems to fail under such structure:


The file count under each date folder varies, is there a way to iterate through all the datas without resorting to pilling all the files under one large folder?

Although getitem receives an index, you can do whatever you want in it, you can ignore it and maintain your own indexing logic, or use it if it helps, see below example (I didn’t test it and for multiple workers, it probably needs to more work, but just to get you the idea for your needs)

class MyDataset(Dataset):
    def __init__(self, root_folder):
        self.root_folder = root_folder
        self.idx_to_date, self.idx_to_len = self._get_idx_to_date_folder_name()
        self.next_index = 0 # or any random int

    def _get_idx_to_date_folder_name(self):
        # iterate over the inner folder/files in root folder, and collect meta data
    	prefix = 'Date'
    	idx_to_date_folder = {}
    	idx_to_len = {}
	    idx = 0
	    for _, dir_names, file_names in os.walk(self.root_folder):
	        for dir_name in dir_names:
                if not dir_name.startswith(prefix):
                idx_to_date_folder[idx] = dir_name
                idx_to_len[idx] = len(file_names)
	            idx += 1
        return idx_to_date_folder, idx_to_len

    def _load_data_from_folder(self, folder_name):
    	raise NotImplementedError('your loading logic')

    def __len__(self):
        # next data from __getitem__ will have this len (for single worker mode)
        return self.idx_to_len[self.next_index]

    def __getitem__(self, index):
    	# ignore index for now, use self.next_index, it will help you return the correct "__len__"
        date_index = self.next_index % len(self.idx_to_date)
        folder_name = self.idx_to_date[date_index]

        X, y = self._load_data_from_folder(folder_name)

        # get ready for next time you request __len__ or __getitem__
    	self.next_index = index

        return X, y

Also, worth mentioning that you don’t have to use the DataLoader, or Dataset, you can create your own python generator (using yield) if it easier for you situation


1 Like

Ah I see, so does pytorch dataloader generate a unique list every time an epoch is finished? say if I have a 6k unique dataset, pytorch would generate an index of 6k length to enumerate (as to not repeating the input and ground truth)?

Hmm, I believe the official Dataloader has some logic regarding to indexing (like you mentioned), maybe also depends on if you set shuffle=True/False ,sampler, etc, but I always add an extra guard with modulo operation ("%") even if it is redundant.
I recommend reading the official guide to have a better understanding of what you can do with DataSet and DataLoader, here is a link:

1 Like