Practical Method for loading HUGE dataset

aerosoul · March 19, 2018, 5:34am

Hi There,

I was facing an practical issue for using pytorch to training over text data. my training data was about 100Gb and cannot be load into memories. I couldn’t create an custom dataset over my training data because of the implementation of getitem and len could be very time intense.

I was wondering if there any practical solution for loading HUGE text data by using pytorch ?

albanD · March 20, 2018, 10:40am

Your dataset should have fixed size? So the len method should be quite fast by just precomputing it before hand?
Why is your getitem is so slow? How do you usually use this dataset if you cannot access it’s elements?

Also do remember that the dataloader will use multiple processes to load the datas in your dataset, so even if getting each element is not blazing fast, it is ok. For example, the ImageFolder dataset reads each image individually from disk when they need to be accessed and it is fast enough to not slow down training.

aerosoul · March 21, 2018, 4:33am

It could be fixed size. I’m using linecache for the getitem so that it need reading all the line for the first time? as you describe the reading process could be an random access rather than an sequential access, right?