NLP model dataset loader

I’m trying to train a BERT-style model using PyTorch. I’m looking for recommendations on how to store the pretraining data (i.e. hdf5, parquet, tfexample, apache arrow…) and how to load it for training in PyTorch.

It would be great if there was a dataset loader that already supports multi-threaded loading, training using multiple workers, mixing different datasets without having to write much code.

I can highly recommend the huggingface datasets, which uses the very ast and efficient Apache arrow data format. GitHub - huggingface/datasets: 🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools