Dataloader trend

dadaddapyer · March 30, 2022, 5:50am

Hi,
I have the following question.

What is the current most popular to store a large dataset (more than 30GB) using PyTorch?
Why do I see people store a large dataset into multiple hdf5 files instead of just one? Will it increase efficiency?
And how to load multiple hdf5 files efficiently in data loading? Basically, how to write the init() and get_item(). I see the post here DataLoader, when num_worker >0, there is bug, but it only talked about a single hdf5 file.
Best regrads.