I am trying to setup up my dataloader to sample from multiple csv files within a directory, each of which contains a variable number of samples.
- Each sample is a row in one of the csv files
- Each file is too big to load in as a single batch (~50k)
- Each file contains a different number of samples (between 30k-60k)
- There are several thousand csv files in the folder
- The entire training set is to large too hold in memory (around 200M samples)
I have looked at some of the examples on this forum and they include using torchvision.datafolder, however I think that assumes that each csv file contains a single sample, which does not apply to my situation.