Dataset for Dataloader

Leo2 · September 29, 2021, 7:50pm

Hi
I have a huge dataset that needs to be loaded into colab in segments. I managed to load the whole dataset as a one off in order to do the necessary processing/cleaning and put it into a giant pandas dataframe but i don’t want to keep it in RAM when i do my training. So I then used to_pickle to save 10 seperate dataframes, (obviously containing many samples each) into files.

I want to use Dataloader because i’ve got a collate_function, and it requires a Dataset.

I designed a custom Dataset to load a single dataframe at a time.

class Custom_dataset(torch.utils.data.Dataset):
def init(self):
self.dataframes = os.listdir(’/content/drive/MyDrive/processed_data’)

def getindex(self, idx):
df= pd.read_pickle(self.dataframes[idx])
return df

def len(self):
return len(self.dataframes)

Is there anyway I can design a custom dataset that will load one file at a time and also one sample from that file at a time. The one i’ve got at the moment obviously loads the entire dataframe