Sample a batch from multiple text files

omer_as · May 7, 2020, 9:44pm

Hey
I am training a language model with several text files (“books”) as input.
In each training iteration, I want the batch to be composed of equally-sized “minibatches” from all books, and I wonder what is the correct way to implement it. If I understand correctly, there’s no way to access multiple datasets while maintaining the ‘identity’ of each dataset and using the same dataloader. Is multiple datasets the the way to go? Should I implement new Sampler and BatchSampler?

An important constraint is that the entire training set doesn’t fit in my computer’s memory so I have to read it as chunks in for each batch.

Thanks for the help!