Shared memory leak while using multiprocessing

remiliani93 · October 20, 2021, 1:36pm

Hi I have set the torch multiprocessing strategy to file_system for a multiprocessing DataLoader. After a certain number of epochs (not the same every time I run) I get one of the workers fails because of insufficient shared memory. While monitoring /dev/shm I notice an increase at the end of each epoch, hence there are some tensors that are not freed. Is there a known issue about this? Or a way to identify where the memory leak could be happening?

robotrapta · November 17, 2022, 4:59pm

I haven’t noticed a problem with leaks during a run. But I definitely see my machine leaking /dev/shm between runs. Hundreds of gigabytes accumulate after a while and clog up the machine.

rm /dev/shm/*

will flush it out, but only do this when nothing is running because it will probably mess up jobs-in-progress.