I have a huge dataset that needs to be loaded into colab in segments. I managed to load the whole dataset as a one off in order to do the necessary processing/cleaning and put it into a giant pandas dataframe but i don’t want to keep it in RAM when i do my training. So I then used to_pickle to save 10 seperate dataframes, (obviously containing many samples each) into files.
I want to use Dataloader because i’ve got a collate_function, and it requires a Dataset.
I designed a custom Dataset to load a single dataframe at a time.
self.dataframes = os.listdir(’/content/drive/MyDrive/processed_data’)
def getindex(self, idx):
Is there anyway I can design a custom dataset that will load one file at a time and also one sample from that file at a time. The one i’ve got at the moment obviously loads the entire dataframe