Best Practices for Storing/Accessing Preprocessed Data

Hey guys,

I’m looking for suggestions on how to speed up process times when storing/accessing large amounts of data for training. Presently, I am using Pandas DataFrames saved to csv. Then I load just the part of the csv I need at any given time in the CustomDataset.
But this has been quite slow, in my opinion, as the csv file is over 300mb(might be bigger later, too). Loading the entire file and accessing the parts I need via .iloc is even slower. So I wanted to check what you all use. I see there are a few options:

  1. Pandas DataFrames
  2. Numpy Arrays
  3. Tensors saved to .pt files
  4. ???

What have you found to work best for performance?

It likely depends on the particular bottlenecks of your system. If there is a large amount of system memory, but limited file I/O, you might want to keep the entire file in memory at all times. If there isn’t much system memory, but sufficient file I/O you could split the file. The actual file format may not matter as much as the basic loading algorithm design.

What I ended up doing that worked quite well was split the data between files with training samples of 1000 each quite easily by setting the filename to 'data-'+str(i//1000)+'.csv'. Since the data has 180 points of data per training sample, the csv file sizes are just a couple mb each. But if your training samples are larger, the same would work by changing 1000---->100.

Just make sure to print “i” at the end of your preprocessing script just to make sure you know what to set the __len__ to in your CustomDataset.