Hi, I noticed that while training a PyTorch model the subprocesses that are started by the dataloader workers are accumulating memory over time while loading new batches and it seems this memory is never released, ultimately resulting in a “dataloader worker does not have sufficient shared memory” error. Could this be a memory leak or is this a known “bug”.
The PyTorch version i am using is 2.0 for CUDA 11.7 in case this information helps.
Thanks a lot!
Could you check if you are running into this issue?
Yes I already checked this issue, unfortunately this didn’t help. What I notice is that even if I set num_workers = 0 the shared memory of the process that runs the script keeps increasing until the shared memory error appears.
Shared memory shouldn’t be used of no multiprocessing is needed in the
DataLoaders. Are you manually sharing tensors somewhere in your code?
I am also observing a shared memory leak in the parent process. It keeps growing way beyond the sum of shm in the dataloader workers.
I also verified that it is not related to the dataloader.
Ok there seems to be a bug when runnning a whole script under a torch.inference_mode() context. Using inference mode only on model calls works fine and the leak is gone. I’ll try to make a minimal reproducible example.