Train from multiple csv files each containing many samples

matthewleigh · March 5, 2021, 1:30pm

Hello,

I am trying to setup up my dataloader to sample from multiple csv files within a directory, each of which contains a variable number of samples.

Each sample is a row in one of the csv files
Each file is too big to load in as a single batch (~50k)
Each file contains a different number of samples (between 30k-60k)
There are several thousand csv files in the folder
The entire training set is to large too hold in memory (around 200M samples)

I have looked at some of the examples on this forum and they include using torchvision.datafolder, however I think that assumes that each csv file contains a single sample, which does not apply to my situation.