I’m trying to train a BERT-style model using PyTorch. I’m looking for recommendations on how to store the pretraining data (i.e. hdf5, parquet, tfexample, apache arrow…) and how to load it for training in PyTorch.
It would be great if there was a dataset loader that already supports multi-threaded loading, training using multiple workers, mixing different datasets without having to write much code.