Any advice for storing and loading datasets with too many small files

mushan · December 17, 2022, 5:29am

Hi there,

I’m currently training a model on a large dataset that contains too many small files (~1.5TB and all files are pure tensors of 50~200kb).

Is there any advice on how to store and load this kind of dataset so that I/O limitations have as little of an effect as possible?

I have tried LMDB and it does not work well in distributed training (multiprocess read). Other people suggest SQLite, is that a good choice?

Thanks in advance.