Dataloader best practices for large sparse matrices

My data is currently stored in 100 .npz files each with a shape of (275,000, 60664) where (samples, features). My current solution is to implement my own data loader that spins off worker process to load in the chunks in random order and then shuffle the chunks and the main thread pulls in the chunk to splice batches from when it needs a new chunk. The problem is I currently have my data passed through a Queue which means my GPU utilization tanks every time a chunk gets exhausted. I believe a solution would be to have the chunks memory mapped but I have not gone down that path and am wondering if there is a way to leverage the data loader in a way that allows this. I would like to leverage the DataLoader and DistributedDataParallel using torchrun however without implementing my own DataLoader I do not know of a way of having the chunks in memory at all times when needed.

I think a neat solution would be to treat the loaders like Dataset’s where you can stack loaders on top of each other and they will only contain the context of the data loader iteration previous. This could also work the same in Parrellel for multiple data loaders.