Massive memory leak when using distributed

I have a distributed code base I am trying to work with, but, with every epoch I see that my CPU memory increases almost linearly eventually running into OOM on a large 128GB machine :((

Without distributed, the code runs fine with no such issues. The isue is exactly describe here: CPU memory gradually leaks when num_workers > 0 in the DataLoader · Issue #13246 · pytorch/pytorch · GitHub

I do use num_workers=16

but the solution posted there, using pyarrow, does not solve my issue - I still have this memory leak. I am on python3.7, poytorch1.9.0.

I do have a custom dataloader, and it looks like so:

class TrainDataSet(Dataset):
    def __init__(self, data_root, mode,label_class=None):
        labels0 = []
        file_paths0 = []
        self.mode = mode
        data_path = Path(data_root)
        if self.mode == "train":
            data_path = data_path / self.mode
        else:
            raise ValueError("Mode not recognised")
        datasets = ImageFolderWithPaths(root=data_path)

        print(datasets.classes)
        print(datasets.class_to_idx)
        # sample is img here and is not used!
        for idx, (sample, target, path) in enumerate(datasets):
            if target==label_class:
                labels0.append(target)
                file_paths0.append(path)
        self.labels=pa.array(labels0)
        self.file_paths=pa.array(file_paths0)
        del labels0 # try to avoid leak?
        del file_paths0 # try to avoid leak?

        self.transform_color=transforms.Compose([transforms.ToTensor(),transforms.Resize(224),
                                    #   transforms.CenterCrop(224),
                                    #   transforms.ToTensor(),
                                      transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        label = self.labels[idx].as_py()
        file_path = self.file_paths[idx].as_py()
        img_rgb = read_img(file_path) # read from opencv to try and see if PIL causes leaks.
        return self.transform_color(img_rgb), label#, file_path

I have totally run out of ideas now :(( and would love to hear from someone with ideas.

UPDATE: I can also confirm that the model+ the other code works totally fine in distributed mode when I swap the dataset to CIFAR from torch datasets and simply use imagefolder on them. i.e the cpu memory consumption stays constant. So, yea, this seems like a dataloader bug :frowning:

  1. Looks like it’s a known dataloader bug, so all possible ideas are already mentioned in CPU memory gradually leaks when num_workers > 0 in the DataLoader · Issue #13246 · pytorch/pytorch · GitHub by authors/people who are responsible for this code
  2. dataloader is not part of distributed module and as you mentioned you don’t have any issues with distributed mode

The problem @pbelevich is that none of the solutions mentioned on the thread work! :cry: and I was hoping the community here might have a solution.

The issue is DistributedSampler creating these Python lists.

There was a fix, merged two months before this question, but got reverted somehow. Two+ years later it’s still the same code and same issue.