I am working on a project in which many small files (~300K) containing .npy arrays are to be combined for training a deep learning model. Unfortunately, the files need to be combined into a unified dataset as each file does not represent a single sample, but rather a variable number of samples. With this being said, there is quite a significant bottleneck during data loading where each of these files must be iterated over and appended to a unified array in order to be in a suitable format for the DataLoader. I understand that this is not an ideal approach, as iterating over many small files can cause significant overhead. This is why my data loading can take upwards of 30 minutes to an hour depending on the size of the dataset.
I was wondering if anyone has experienced a similar circumstance and would be willing to explain how you dealt with it. One idea I have is to save a range of small files into single, larger files, such that there is much less smaller files, but a few larger files instead. This (I assume) would cause less overhead as there is less files to iterate through.
Does anyone know of any other advice that can be applied to help speed up my data loading procedure given this scenario? Any help is much appreciated.