Configuring Dataset to load a very large file whilst supporting multiple workers

I’m having trouble creating a Dataset class that fully meets my specifications. I can imagine that others have also had this problem, but I have been unable to find a proper solution on the internet.

I am working with a very large (200GB+) data file which I want to dynamically load into a pytorch Dataset class. Ideally, I need the following process to happen when getitem(idx) is called:

  1. By some mechanism, the idx’th line of the datafile (in my case jsonl) is loaded into memory.
  2. Some reasonably computationally expensive data augmentation is applied to the loaded data.
  3. The augmented data is encoded into a torch tensor and returned.

Because of (2), and due to the fact that I am using multiple GPUs to ultimately train my model, I need the Dataset to support multiple workers (in the DataLoader). I cannot express how huge the datafiles that I’m working with are. I am trying find a simple solution for (1) that ideally does not involve streaming my data in a proper DB. Here is what I have tried:

  1. Using a IterableDataset. On the surface this solution sounds like a good idea - I could directly use the readline method in the jsonlines package to create an iterator. The downside is that configuring this to work with multiple workers really sounds like a pain (each worker would be given it’s own instance of the iterator ).

  2. Using mmap or something similar. One idea I had is to index the data file, recording the byte locations of the beginning of each line in my datafile. I could then use mmap.move and mmap.readline to load a particular line, by referencing it’s location using the index. This solution would probably work, but I have never worked with mmap before.

Is the industry standard approach to use (2)? If so, are there any Python packages that will do the heavy lifting for me, or should I rely on mmap? I would have assumed that people working with big datasets in NLP have come across this problem.

Thanks in advance.

Is the industry standard approach to use (2)?

I think probably the most common approach is to simply break up the file up into a larger number of smaller file partitions. Then each worker can load from different subset of the underlying files.

I’d say option 2 is more standard.
With option 1 skipping to the last element of the dataset will be very slow.

I found the following relevant stackexchange post

I’m going to implement it using mmap and jsonl.