I have a large dataframe dataset. There are float columns but also columns with lists of int. I wanted to optimise ram usage by storing it in 10 parquet files and reading them one by one while training. But when my dataset class is loading another parquet file into my “data” variable ram usage seems to slightly increase. This goes on and on, ram usage appears only larger when I view it via htop. Is there any way to fix it? Also, open to advice on how to solve the problem of loading the initial large dataset to ram.
IIUC, You can load each parquet file lazily in
__getitem__ function in your
Perhaps I did not write it clearly. As I understand I’m doing everything lazily and every time I check the current file (partition). If this parquet is not in memory (i.e. I’ve read the previous one), I read it.
Sovled this issue just replacing pandas dataframe with pyarrow table. It takes 60% less ram for me, but not as fast in terms of getting rows values. My workaround is to make a numpy list from columns. Unfortunately, it takes ram as it used to with pandas.
But my main problem with memory accumulation is gone.