How to effectively load a large text dataset with PyTorch?

Emanuel_Huber · October 15, 2021, 9:23pm

I have hundreds of CSV files that each contain hundreds of megabytes of data. To create a class that inherits from PyTorch’s Dataset the getitem method must access a single sample at a time, where the i parameter of the function indicates the index of the sample. However, to perform lazy loading my class just saves the name of each file instead of saving all the data in memory.

So far so good, but I’m having doubts about how to access a single data in the __getitem__ method, in my searches I found the following strategies:

In __init__ save the number of samples that each file has, then in getitem load the file with pandas and access the respective line. It is possible to know which file to access by the id received as a parameter, for example: having 2 files, where each has 100 samples, id 100 would represent the first sample of the second file, id 101 would represent the second sample of the second file and so on against. However, this approach has an obvious problem which is the number of IOs needed.
Perform the same process presented in 1, but store files in a cache as long as it fits in memory. This approach seems to be better than the first as it reduces the number of IOs as some files will be in memory. The problem with this approach is how the caching policy will be carried out, I thought of using something like dask to handle this, but iloc is quite inefficient according to the documentation.

Does anyone have a better idea or indication of some material?

AbdulsalamBande · October 15, 2021, 9:46pm

@Emanuel_Huber , why dont you featurizer(embedding or one hot) your text data and store the data in binary format using numpy.memmap() numpy.memmap — NumPy v1.21 Manual . Then you can use the normal getitem method.

Emanuel_Huber · October 16, 2021, 5:43pm

That would be a good solution, but I found a more straightforward one that probably will be integrated on the official PyTorch code, the webdataset package.

Emanuel_Huber · October 18, 2021, 1:05pm

After messing with webdataset I could not use it for my case. When I convert my 20Gb dataset with webdataset ShardWriter or with tarp CLI that conversion generated a 200Gb file and I do not have this space available on my disk. I am not sure, but I think that TFRecord is a better and less disk-consuming approach.
After this little journey, my ending was to load all the raw data in memory and tokenized it before passing it to the network. If you have enough RAM, I advise you to do the same.