How to use a huge line corpus (text) with Dataset/DataLoader?

If you have a very large corpus of text/documents, one document per line (actually it is a tsv file, so the label would be in one column, the document text without tabs in another, all fields separated by tabs), and there is no way to fit this data, or any numeric representation created from it into memory, how does one go about creating a Dataset subclass for it?

The PyTorch Dataset implementation requires that the __getitem__(self, index) method is implemented, but this requires direct access to each instance as you would get with all data loaded into memory.
But some data is just so big it can only be accessed sequentially, how is this done properly in PyTorch?
I would still want the dataloader to give me batches of data etc., but for data which cannot be easily accessed sequentially.

How do other people do this for large datasets of text?

I am reluctant to put every document into a file, since that would mean millions of files would have to get created which is a massive overhead on the hard disk in many ways (unused storage per file, massive disk segmentation slowing down access etc).

So what is the correct way to do this in PyTorch or are there any other libraries which support this?

3 Likes

Have a look at this small example using a pd.DataFrame to read small chunks of data.
Let me know, if that works for you.

Thank you for that suggestion, but this approach would use the CSV parser to read ALL the rows before the desired one every time, that means for N rows there would be N*N/2 full parses of the line, so it would be extremely slow.

I had been thinking that maybe it would be possible to directly skip to the correct position inside the file using seek, but this is very difficult to get right if the file can be in any encoding or even just for UTF8, I think.

Oh, I didn’t know that. I assumed skiprows=index * self.chunksize + 1 would indeed skip the rows without loading them.

I am sorry, when I think about this again, I realize I actually only assumed this but I do not know. I could not find any documentation about this, and when I tried to look at the pandas code it turns out that this parameter gets passed to the C-code of the parser. So they may do something clever here.
Thank you for the hint, maybe I can find out more!

As another update, I found that indexing a UTF-8 file is apparently actually possible when reading lines binary as bytes. So this may be one way to do it, but I am still wondering if anyone knows about a more “polished” or standard way to do this?

OK, I tested this with my file and tried to get the 10000th row (that is probably about 10G into the file)
Pandas using skiprows=9999, chunksize=1 took 67.771 seconds, the approach I now implemented by indexing the start of line offsets and then directly skipping to the requested line takes (without the indexing which is only done once) 0.016 seconds.

2 Likes

I am interested in your solution. Would you be able to post your method of indexing and reading in bytes?

1 Like

I would also appreciate if you can share the ideas of your solution.

Also I found a pretty fast solution here in case anyone has the same problem.

3 Likes

One of the solutions is to use a list of byte offsets and construct sub-datasets based on the byte offsets. An example is here.

Another solution is IterableDataset supported by pytorch torch.utils.data. A relevant example is here.