I have a folder of 500 text files, each contains about 200 million lines, which is not wise to load them into memory all at once.
I want to batch 64 lines every iteration just by their orders in the files, but haven’t found any efficient way to do so in pytorch.
It seems to me that in order to support random sampling,
data.Dataset demands implementing
__len__ , these two methods is unnecessary in my case and holds me back from creating my customized Dataset.
I come up with one solution, which is splitting these files in small ones, create a Dataset for each of them and use
ConcatDataset to iterate, but is doesn’t seem graceful enough.
Any idea on how to do that just like in Tensorflow?