Pytorch Dataloader Memory Leak

Hi, I noticed that while training a PyTorch model the subprocesses that are started by the dataloader workers are accumulating memory over time while loading new batches and it seems this memory is never released, ultimately resulting in a “dataloader worker does not have sufficient shared memory” error. Could this be a memory leak or is this a known “bug”.
The PyTorch version i am using is 2.0 for CUDA 11.7 in case this information helps.

Thanks a lot!

Could you check if you are running into this issue?

Yes I already checked this issue, unfortunately this didn’t help. What I notice is that even if I set num_workers = 0 the shared memory of the process that runs the script keeps increasing until the shared memory error appears.

Shared memory shouldn’t be used of no multiprocessing is needed in the DataLoaders. Are you manually sharing tensors somewhere in your code?

1 Like

I am also observing a shared memory leak in the parent process. It keeps growing way beyond the sum of shm in the dataloader workers.
Capture d’écran du 2023-11-07 16-02-24

I also verified that it is not related to the dataloader.

Ok there seems to be a bug when runnning a whole script under a torch.inference_mode() context. Using inference mode only on model calls works fine and the leak is gone. I’ll try to make a minimal reproducible example.

1 Like

I think I am encountering the same problem. my eval() function leaks memory

The eval() function is wrapped with torch.no_grad() and the model is set to model.eval()

Is this the same issue? Could you please point to my problem?

Hi! In my case, the problem was that I was saving my targets straight from the dataloader and thus with tensors stored in shared memory.

If you move that data to the GPU or unshare it should solve the issue.

all_predictions = []
all_targets = []
for batch_in, batch_tgt in val_loader:
    batch_in = batch_in.to(device, non_blocking=True)
    bath_tgt = batch_tgt.clone()  # <<<--- do this

    with torch.inference_mode():
        batch_prediction = model(batch_in)

    all_predictions.extend(batch_prediction)
    all_targets.extend(bath_tgt)