I wrap my data with Dataset, then use Dataloader for enumerate. But because of copy-on-write mechanism, my memory goes so high out of expected.
My problem can be simplified as following:
class DataIter(Dataset): def __init__(self): self.data = range(90317731) def __len__(self): return len(self.data) def __getitem__(self, idx): return torch.Tensor(self.data[idx])
Then, i use Dataloader and for-loop to fetch data.
train_data = DataIter() train_loader = DataLoader(train_data, batch_size=64, shuffle=True, num_workers=8) for i, item in enumerate(train_loader): print(i)
After it is running , i watch my memory(RAM, RSS). It costs about 20GB RSS due to copy-on-write in subprocess. How to deal with it?
self.data = range(90317731) should cost about 2~3GB using python list. I know using Numpy can reduce symptom, it reduces the train_data size, so the subprocess copies less.
Summarize my problems:
How to reduce the memory cost by subprocess due to copy-on-write? Using Manager or something else?
Pytorch has only considered about the Tensor shared memory, but without DataSet class? am I right?
I’m using Python2.7