Efficient (lazy) dataset loading of large .npy file?

Hi, I want to know the most efficient Dataset/DataLoader setup to lazy load a large .npy array dataset. I’ve already tried these methods and found them be needlessly inefficient:

• Loading the .npy file as a persistent numpy mmap (via np.load(mmap_mode=‘r’)) → The caching is great, but unfortunately it doesn’t play nice with parallel data loaders and occasionally makes __getitem__() take like x4-x6 longer than it does with num_workers=0 (e.g. 30-60 seconds vs 4 seconds).
• Loading the .npy file as a persistent torch mmap (via torch.from_file) → This is sometimes faster than numpy, but it has the same problem when num_workers>0 and it sometimes OOMs from too much shared memory (i.e. greater cache usage).
• Lazily opening (& closing) temporary mmaps inside __getitem__()There is a strange problem in linux where opening multiple mmaps on the same file causes extra overhead. …And I just found this method slower than the other options.

How can I achieve the ideal scenario where data loading/preprocessing costs are hidden by the DataLoader() in this situation?

P.S. Once by chance (I cannot reproduce this) I got the dataloaders & the mmap working together; I had achieved like 80 ms per batch.