When we try to make some thing about LLM (dataset contains audio or image modality, not only text) from scratch, there are a lot of training data to consume for training, most of them are little, small feature files. and that may make bottleneck on dataloader specilly on big batch size.
My question is, which is the better feature save format for training loading speed? the numpy np file? the torch pt file? or the joblib serialized file? which is faster for loading during traning?