What is the best data storage option for PyTorch?

I’ve been using PyTorch to estimate discrete choice models with large choice sets. The data are “sessions” where each session consists of ~10k rows corresponding to the choice set.

Currently, I’m storing the data in an h5 indexed by a unique session key, then reading in sessions in batches with a DataLoader. This works, but is very slow (the IO time is ~10x the computation time). On disk, the h5 file is about 500gb.

Is there a better option? I’ve seen WebDataset, which I’m planning to try next barring any better suggestions!

Some quick benchmarks on a large-but-not-all sample (~83 million rows), reading in subsets of data using query for H5 and filters kwarg for parquet.

H5 with session id as index
– On disk: 5.05gb
– Read in time, 1 session: 0.018 seconds
– Read in time, 5000 sessions: 26.9 seconds

Parquet (pyarrow engine) with session id partitions:
– On disk: 490mb
– Read in time, pandas, 1 session: 2.6 seconds
– Read in time, pandas, 5000 sessions: 7.6 seconds
– Read in time, dask, 1 session: 2.2 seconds
– Read in time, dask, 5000 sessions: 47.1 seconds

Somewhat surprised that dask was far slower than Pandas for reading in larger numbers of sessions. Also surprised at the disk space differences. For now, I think parquet + pandas will be sufficient for me – but would love to hear if anyone has other options that are worth a shot!

Ok just for fun I also tried loading it to BigQuery with session id partitions (for other reasons, having it in BQ would be useful so was curious how much of a hit it’d be to compute time)
– On disk: 5.5gb
– Read in time, 1 session: 2.5 seconds
– Read in time, 5000 sessions: 26.4 seconds