Torchtext - dataset from multiple files

I’ve been looking at the torchtext library, which seems like it is very useful. If I understood the example datasets (which is likely not the case!), all of these are loaded completely into memory prior to being used. Is there any support (or planned support) for loading data from files onlly as they are required?

1 Like

You can implement a custom dataset that does lazy loading by writing the data loader as a Python generator (with yield) and using an iterator that doesn’t sort or shuffle the data except within small buckets. That way the iterator will request a bucket worth of data, the loader will load it into memory, and the iterator will sort/shuffle it without loading the whole dataset.

Has there been any additions to torchtext that does lazy loading?

1 Like