I was facing an practical issue for using pytorch to training over text data. my training data was about 100Gb and cannot be load into memories. I couldn’t create an custom dataset over my training data because of the implementation of getitem and len could be very time intense.
I was wondering if there any practical solution for loading HUGE text data by using pytorch ?
Your dataset should have fixed size? So the
len method should be quite fast by just precomputing it before hand?
Why is your
getitem is so slow? How do you usually use this dataset if you cannot access it’s elements?
Also do remember that the dataloader will use multiple processes to load the datas in your dataset, so even if getting each element is not blazing fast, it is ok. For example, the
ImageFolder dataset reads each image individually from disk when they need to be accessed and it is fast enough to not slow down training.
It could be fixed size. I’m using
linecache for the
getitem so that it need reading all the line for the first time? as you describe the reading process could be an random access rather than an sequential access, right?