Train using many input files without overloading memory

I am trying to setup up my dataloader to sample from multiple csv files within a directory, each of which contains a variable number of samples.

  • Each sample is a row in one of the csv files
  • Each file is too big to load in as a single batch (~50k)
  • Each file contains a different number of samples (between 30k-60k)
  • There are several thousand csv files in the folder
  • The entire training set is to large too hold in memory (around 200M samples)

I have looked at some of the examples on this forum and they include using torchvision.datafolder, however I think that assumes that each csv file contains a single sample, which does not apply to my situation.