If you have a very large corpus of text/documents, one document per line (actually it is a tsv file, so the label would be in one column, the document text without tabs in another, all fields separated by tabs), and there is no way to fit this data, or any numeric representation created from it into memory, how does one go about creating a Dataset subclass for it?
The PyTorch Dataset implementation requires that the
__getitem__(self, index) method is implemented, but this requires direct access to each instance as you would get with all data loaded into memory.
But some data is just so big it can only be accessed sequentially, how is this done properly in PyTorch?
I would still want the dataloader to give me batches of data etc., but for data which cannot be easily accessed sequentially.
How do other people do this for large datasets of text?
I am reluctant to put every document into a file, since that would mean millions of files would have to get created which is a massive overhead on the hard disk in many ways (unused storage per file, massive disk segmentation slowing down access etc).
So what is the correct way to do this in PyTorch or are there any other libraries which support this?