Storing large datasets on disk in a format that allows indexing without loading?

You could use webdataset to store your dataset in .tar format. Then you can load them as DataLoader just like PyTorch. It also supports multiprocessing during training.