I’ve been using PyTorch to estimate discrete choice models with large choice sets. The data are “sessions” where each session consists of ~10k rows corresponding to the choice set.
Currently, I’m storing the data in an h5 indexed by a unique session key, then reading in sessions in batches with a DataLoader. This works, but is very slow (the IO time is ~10x the computation time). On disk, the h5 file is about 500gb.
Is there a better option? I’ve seen WebDataset, which I’m planning to try next barring any better suggestions!
Some quick benchmarks on a large-but-not-all sample (~83 million rows), reading in subsets of data using query for H5 and filters kwarg for parquet.
H5 with session id as index
– On disk: 5.05gb
– Read in time, 1 session: 0.018 seconds
– Read in time, 5000 sessions: 26.9 seconds
Parquet (pyarrow engine) with session id partitions:
– On disk: 490mb
– Read in time, pandas, 1 session: 2.6 seconds
– Read in time, pandas, 5000 sessions: 7.6 seconds
– Read in time, dask, 1 session: 2.2 seconds
– Read in time, dask, 5000 sessions: 47.1 seconds
Somewhat surprised that dask was far slower than Pandas for reading in larger numbers of sessions. Also surprised at the disk space differences. For now, I think parquet + pandas will be sufficient for me – but would love to hear if anyone has other options that are worth a shot!
Edit:
Ok just for fun I also tried loading it to BigQuery with session id partitions (for other reasons, having it in BQ would be useful so was curious how much of a hit it’d be to compute time)
– On disk: 5.5gb
– Read in time, 1 session: 2.5 seconds
– Read in time, 5000 sessions: 26.4 seconds