Hi,
I am loading many small .npy array files that each contain a variable-length sequence of elements of the shape TxN
, where T
is a particular sequence length and N
is the feature size (the same for all arrays).
I am currently loading each .npy array file, converting it to a tensor, and then appending it to a python list. This is all happening in the __init__
of a Dataset
class.
Is there a more efficient way to be loading the data in my scenario? If my sequence lengths were constant, I could pre-allocate a tensor of the appropriate size and then do in-memory slicing on them directly from the loaded data. I have also heard about loading the data ‘lazily’ in the __getitem__
in cases where the data may be too large to fit in memory, but would this make a difference in terms of the time it takes to load the data? Also, is my current approach amenable to multi-processing using num_workers?