I wrap my data with Dataset, then use Dataloader for enumerate. But because of copy-on-write mechanism, my memory goes so high out of expected.
My problem can be simplified as following:
class DataIter(Dataset):
def __init__(self):
self.data = range(90317731)
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return torch.Tensor(self.data[idx])
Then, i use Dataloader and for-loop to fetch data.
train_data = DataIter()
train_loader = DataLoader(train_data, batch_size=64,
shuffle=True, num_workers=8)
for i, item in enumerate(train_loader):
print(i)
After it is running , i watch my memory(RAM, RSS). It costs about 20GB RSS due to copy-on-write in subprocess. How to deal with it? self.data = range(90317731)
should cost about 2~3GB using python list. I know using Numpy can reduce symptom, it reduces the train_data size, so the subprocess copies less.
Summarize my problems:
-
How to reduce the memory cost by subprocess due to copy-on-write? Using Manager or something else?
-
Pytorch has only considered about the Tensor shared memory, but without DataSet class? am I right?
I’m using Python2.7