Load training data from one single huge file

flyaway · November 12, 2018, 5:15pm

Hi, all,
I have problems regarding reading large training data from disk.
The situation is I have one single huge file (roughly 100G), each line is a training example.
Clearly I can not load them all into memory.
When I searched for the solution.
One possible solution is to split this huge file into small files and use DataSet and DataLoader like mentioned in here: Loading huge data functionality

However, in this way, the total number(the return value of __len__ in DataSet) of training example becomes the number of small files not the true number of my training examples. DataLoader sees each small file as one example which is weird.
Can any one provide an elegant solution for this situation? Perhaps producer/consumer style?

jetcai1900 · July 20, 2019, 6:35pm

Are there any solutions for this? Thanks.

marchss · August 15, 2019, 5:54am

Same issue here. Any ideas?

flyaway · August 15, 2019, 3:23pm

@marchss @jetcai1900 Not yet. Finally, I implemented the producer/consumer loader by myself.

marchss · August 19, 2019, 7:23am

Are you willing to share a short example with us? Thanks