Parquet data loading strategy

arielbosano · October 31, 2025, 2:08pm

Hello everyone, I hope my question isn’t one of those that appears very frequently in these forums. I’m new to PyTorch and I’m struggling with a tabular dataset. We’re building a classification network trained on a 200M-row dataset (fully tabular with numeric features). Our current production pipeline pre-writes all the data stored in a Delta table to Parquet files, with all features transformed and ready for training.

Our issue starts when reading these files: the cluster metrics show a severe CPU bottleneck with only ~7% GPU utilization. The training pipeline uses a custom dataset that reads one file per worker using PyArrow, converts each column into a tensor, and yields rows. Each file is ~150 MB. We’re not using prefetching because it exhausts the cluster’s RAM (we have 64 GB, and the entire dataset is 120 GB).

Has anyone faced similar scenarios? I’ve read about using NVTabular or Dask, and I also noticed that the Torch Parquet dataset package was deprecated.