Practical Method for loading HUGE dataset

Hi There,

I was facing an practical issue for using pytorch to training over text data. my training data was about 100Gb and cannot be load into memories. I couldn’t create an custom dataset over my training data because of the implementation of getitem and len could be very time intense.

I was wondering if there any practical solution for loading HUGE text data by using pytorch ?

Your dataset should have fixed size? So the len method should be quite fast by just precomputing it before hand?
Why is your getitem is so slow? How do you usually use this dataset if you cannot access it’s elements?

Also do remember that the dataloader will use multiple processes to load the datas in your dataset, so even if getting each element is not blazing fast, it is ok. For example, the ImageFolder dataset reads each image individually from disk when they need to be accessed and it is fast enough to not slow down training.

It could be fixed size. I’m using linecache for the getitem so that it need reading all the line for the first time? as you describe the reading process could be an random access rather than an sequential access, right?