Hi, I want to know the most efficient Dataset
/DataLoader
setup to lazy load a large .npy array dataset. I’ve already tried these methods and found them be needlessly inefficient:
• Loading the .npy
file as a persistent numpy mmap (via np.load(mmap_mode=‘r’)
) → The caching is great, but unfortunately it doesn’t play nice with parallel data loaders and occasionally makes __getitem__()
take like x4-x6 longer than it does with num_workers=0
(e.g. 30-60 seconds vs 4 seconds).
• Loading the .npy
file as a persistent torch mmap (via torch.from_file
) → This is sometimes faster than numpy, but it has the same problem when num_workers
>0 and it sometimes OOMs from too much shared memory (i.e. greater cache usage).
• Lazily opening (& closing) temporary mmaps inside __getitem__()
→ There is a strange problem in linux where opening multiple mmaps on the same file causes extra overhead. …And I just found this method slower than the other options.
How can I achieve the ideal scenario where data loading/preprocessing costs are hidden by the DataLoader()
in this situation?
P.S. Once by chance (I cannot reproduce this) I got the dataloaders & the mmap working together; I had achieved like 80 ms per batch.