I have a distributed code base I am trying to work with, but, with every epoch I see that my CPU memory increases almost linearly eventually running into OOM on a large 128GB machine :((
Without distributed, the code runs fine with no such issues. The isue is exactly describe here: CPU memory gradually leaks when num_workers > 0 in the DataLoader · Issue #13246 · pytorch/pytorch · GitHub
I do use num_workers=16
but the solution posted there, using pyarrow, does not solve my issue - I still have this memory leak. I am on python3.7, poytorch1.9.0.
I do have a custom dataloader, and it looks like so:
class TrainDataSet(Dataset):
def __init__(self, data_root, mode,label_class=None):
labels0 = []
file_paths0 = []
self.mode = mode
data_path = Path(data_root)
if self.mode == "train":
data_path = data_path / self.mode
else:
raise ValueError("Mode not recognised")
datasets = ImageFolderWithPaths(root=data_path)
print(datasets.classes)
print(datasets.class_to_idx)
# sample is img here and is not used!
for idx, (sample, target, path) in enumerate(datasets):
if target==label_class:
labels0.append(target)
file_paths0.append(path)
self.labels=pa.array(labels0)
self.file_paths=pa.array(file_paths0)
del labels0 # try to avoid leak?
del file_paths0 # try to avoid leak?
self.transform_color=transforms.Compose([transforms.ToTensor(),transforms.Resize(224),
# transforms.CenterCrop(224),
# transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
def __len__(self):
return len(self.file_paths)
def __getitem__(self, idx):
label = self.labels[idx].as_py()
file_path = self.file_paths[idx].as_py()
img_rgb = read_img(file_path) # read from opencv to try and see if PIL causes leaks.
return self.transform_color(img_rgb), label#, file_path
I have totally run out of ideas now :(( and would love to hear from someone with ideas.
UPDATE: I can also confirm that the model+ the other code works totally fine in distributed mode when I swap the dataset to CIFAR from torch datasets and simply use imagefolder on them. i.e the cpu memory consumption stays constant. So, yea, this seems like a dataloader bug