I got 25 datasets (data as .mtx format (it contains transposed scipy.csr sparse matrix), and label data as .txt format), when I make datasets as normal matrix, total datasets size is quite large, larger than my RAM. (> 100GB)
I want to use all datasets as train data.
The first thing I could think of was to replace each .mtx with a matrix file, save it to disk, shuffle the 25 matrix files so that there is no bias during training, and then call the dataset one at a time.
However, this method takes up additional disk memory, so I’m wondering if there is a better way to do it.
Is there a better way to do this?