Hi, all,
I have problems regarding reading large training data from disk.
The situation is I have one single huge file (roughly 100G), each line is a training example.
Clearly I can not load them all into memory.
When I searched for the solution.
One possible solution is to split this huge file into small files and use DataSet and DataLoader like mentioned in here: Loading huge data functionality
However, in this way, the total number(the return value of __len__
in DataSet
) of training example becomes the number of small files not the true number of my training examples. DataLoader sees each small file as one example which is weird.
Can any one provide an elegant solution for this situation? Perhaps producer/consumer style?