I was looking at ConcatDataset too, but one of my question is does it support shuffle between datasets?
Let’s say I have 2 datasets, A, and B.
Can data be shuffled from A to B, and B to A? From the code (I don’t understand the code 100%), it seems that data within their own dataset are shuffled.
I created one dataset for each file, and if there’s only 3000 files then it isn’t that much to hold it inside an array (object that has a reference to).
If you wrote your DataSet with linecache, then it won’t read each file into memory.
At least this is my observation after reading more files than my computer’s memory can support.
I am going to be working with many files, that all can’t fit into memory at the same time.
Since the data is text that I would like to be loaded quickly, I was hoping to use ether pytorch.load/pickle files, or pandas/parquet files.
However, it looks like ConcatDatastet can’t work for my case, since there’s no way to load specific lines only, like with csv files and linecache.getline, so I’ll have to load all the files at once.
So now I’m thinking I’ll just load a fraction of the files, as much as will fit in memory, and do a ConcatDatastet for those files. Perhaps using pytorch multiprocessing to load those several files at once. And then use ConcatDatastet of that chunk.
I’ll probably have to code additional shuffling code for shuffling among the chunks.
Does this strategy sound like the best approach? Or are there some additional Pytorch tools that can help me around this?
Sorry !I’m new to pytorch.I want to check my guess is right.
while calling getitem in LazyTextDataset ,it will generate I/O one time.
For example, if my training dataset is 40000 ,it will generate I/O 40000 time.
Is it time-consuming job while we have to read training data?
I have been trying to use this method to go through my dataset of around one thousand csv files, each containing about 300k lines (one training sample per line).
This method leads to a slow buildup of CPU memory as I iterate through the batches until it crashes.
I have tried putting in linecache.clearcache() at the end of each batch but it isnt clearing up the memory.