Hi I have set the torch multiprocessing strategy to file_system
for a multiprocessing DataLoader. After a certain number of epochs (not the same every time I run) I get one of the workers fails because of insufficient shared memory. While monitoring /dev/shm
I notice an increase at the end of each epoch, hence there are some tensors that are not freed. Is there a known issue about this? Or a way to identify where the memory leak could be happening?
1 Like
I haven’t noticed a problem with leaks during a run. But I definitely see my machine leaking /dev/shm
between runs. Hundreds of gigabytes accumulate after a while and clog up the machine.
rm /dev/shm/*
will flush it out, but only do this when nothing is running because it will probably mess up jobs-in-progress.