Debugging memory (RAM) buildup during training

kheyer · October 3, 2022, 9:57pm

Looking for help debugging RAM memory buildup during training.

Current situation:
My dataset is a very large CSV
I load 5m rows at a time
I create a new dataset and dataloader
I train on the dataloader
After going through the data, I delete the loaded dataframe, the dataset and the dataloader and run garbage collection

System memory usage starts out at ~5GB and after a day and a half has risen to 90GB/128GB. I’d like this training loop to run for several days, so I’m concerned about the memory buildup.

I’m aware of previous posts on the topic of dataloader memory buildup. I’ve tried passing my main inputs (a list of strings) as a numpy array, and I’m still seeing the buildup.

I’m also interested in figuring out why I get the buildup given that after each 5m data chunk, I explicitly delete the dataframe, dataset and dataloader. When this happens, there’s a noticeable memory release in the system, but the “reset” memory usage is still higher than the original memory usage, leading to the gradual buildup.

Is there a persistent multiprocessing Pool object that survives between dataloaders? I know those won’t release memory until the pool is explicitly closed.

Any thoughts on addressing this issue?