I’m currently training a model on a large dataset that contains too many small files (~1.5TB and all files are pure tensors of 50~200kb).
Is there any advice on how to store and load this kind of dataset so that I/O limitations have as little of an effect as possible?
I have tried LMDB and it does not work well in distributed training (multiprocess read). Other people suggest SQLite, is that a good choice?
Thanks in advance.