Play with large tabular data


I’m playing with a large tabular data with 20M samples and 25K features. The data is stored in 102 .mtx files(~250GB). I would like to train a score based generative model with this data. My memory size is 64GB and my GPU is 4060 16GB. What will be a good approach to prepare the dataset? My idea is to split the samples with batch size 1000 and save each batch to a file? Thanks in advance for any ideas, suggestions or discussion.