Using Dataset to iterate over lines of large files


I have a folder of 500 text files, each contains about 200 million lines, which is not wise to load them into memory all at once.

I want to batch 64 lines every iteration just by their orders in the files, but haven’t found any efficient way to do so in pytorch.

It seems to me that in order to support random sampling, data.Dataset demands implementing __getitem__ and __len__ , these two methods is unnecessary in my case and holds me back from creating my customized Dataset.

I come up with one solution, which is splitting these files in small ones, create a Dataset for each of them and use ConcatDataset to iterate, but is doesn’t seem graceful enough.

Any idea on how to do that just like in Tensorflow?

I would recommend dataset as follows

  1. in init process and compute full data length. This is required for iterator
    Don’t confuse this with actual data size.
    I will go ahead and store in chunk more than 64 to minimize disk/file access. e.g. read 4096 lines everytime and store in buffer. Actual datasize will be overall length and we maintain a pointer modulo to 4096 which will iterate till last chunk.

  2. getitem: If still in current chunk, directly return what you already have and increment local pointer
    If local index equals to 64 (64 8 64 = 4096) then you are at the end of buffer. Now load next 4096 lines.

Now, control 4096 (number of lines to cache) depending on how much RAM you want to spend in buffer (it’s a tradeoff) You will have to maintain a list of file name to iterate over

1 Like

Refer to How to use dataset larger than memory?. The ChunkDataset API might help you