Loading many small variable-length numpy array files


I am loading many small .npy array files that each contain a variable-length sequence of elements of the shape TxN, where T is a particular sequence length and N is the feature size (the same for all arrays).

I am currently loading each .npy array file, converting it to a tensor, and then appending it to a python list. This is all happening in the __init__ of a Dataset class.

Is there a more efficient way to be loading the data in my scenario? If my sequence lengths were constant, I could pre-allocate a tensor of the appropriate size and then do in-memory slicing on them directly from the loaded data. I have also heard about loading the data ‘lazily’ in the __getitem__ in cases where the data may be too large to fit in memory, but would this make a difference in terms of the time it takes to load the data? Also, is my current approach amenable to multi-processing using num_workers?

You are facing a trade-off between RAM use vs speed.
If you have enough ram to do so, your process is the fastest, as the arrays are already loaded and allocated there. As you mention, you could load the files dynamically in the getitem function and optimizer a bit everything but you aren’t gonna find a real performance improvement by spending time on this.

The only interesting thing you can do is pad the sequences and create a tensor allocated in the GPU (in case you ahve space), as that would save you some time.