Large dataset (HDF5) with concurrent reads - and buffer?

Dear all,

I’m currently building a large textual dataset which will grow to tens of millions of text objects (i.e., rows in my dataset) pretty soon. My goal, among other things, is to apply neural topic modelling. I have access to GPUs, however, the whole dataset won’t fit into memory… so I need to come up with an efficient and effective solution for training.

My current plan:

  • Store the data in single HDF5 file.

  • HDF5 allows concurrent reads so I can use PyTorch’s DataLoader with multiple workers to split the workload.

  • I’d also want to load random batches from the dataset which should be possible with HDF5… will still have to evaluate reading speed implications, though.

  • There has been some discussion in this forum around this topic (e.g., DataLoader, when num_worker >0, there is bug) which is already pretty helpful!

  • However, what I haven’t really found out so far how I could possibly combine this with some sort of buffer: In principal, my idea would be to load x number of batches (with random samples from the dataset) into memory (the buffer) to minimize reading from the file. I think that the Sampler DataPipe from PyTorch goes into this direction… but I don’t really need an iterative solution (I want to be able to store random random samples while making sure that I don’t sample the same random samples across workers…).

Has anyone ever implemented something like that? Possibly with public code? Any thoughts are appreciated, thanks!