I am using DataLoader to load my training data.
According to the document, we can set num_workers to set the number of subprocess to speed up the loading process.
However, when I use it like this:
I found the memory usage keep growing, which is not happening when I set num_worker=0.
Is this supposed behavior ? I have limited memory resources, so I don’t want the memory usage keeps growing.
P.S. My code is running on the GPU, every time I move a batch of data from cpu to gpu.
I think that when using multiple workers, each of them loads its data on the cpu before sending it to gpu. Which makes the amount of data in memory bigger. Also it creates a new process per worker.
I am facing this issue even with the updated PyTorch nightly version. The Dataloder memory usage continuously increases until it runs of memory. My Dataset size is 26GB when initialized, it contains an ndarray from which I return an element based on index value. After running on 10% of data it ends up using another 30+GB of ram and 40GB + swap space. I tried upgrading the PyTorch with the latest nightly version. Tried both on python3.6 and Python3.7, but the issue persists.
Now, for PyTorch 1.2.0, multiple workers don’t make multiple memory copy of dataloader object. But still such problems some times. I find that setting num_worker < physical cpu kernels works fine.
I just load the whole dataset in the init() of DataLoader, others are just as normal. Be sure to make num_worker < physical cpu . I don’t know why, but it works well when your memory is limited. E.g., my dataset is about 30 GB, I can run on a two-GPU machine with 80 GB memory. Num_worker and physical cpu are 10 and 12 respectively.
It is a feature of Python. You may encounter this problem when using native Python structures like map or list. I have developed a NEW TOOL called cstl (GitHub - fuzihaofzh/cstl: The C++ Standard Template Library (STL) for Python.) by wrapping C++ STL containers to solve this issue. It supports multiple types including nested maps and nested lists which the numpy and pytorch do not support.
Here is a simple example showing how it solves the problem:
from torch.utils.data import Dataset, DataLoader
import numpy as np
import torch
import copy
import sys
import cstl
from tqdm.auto import tqdm
class DataIter(Dataset):
def __init__(self):
cnt = 24000000
self.cnt = cnt
#self.data = np.array([x for x in range(cnt)]) # Good
#self.data = [x for x in range(cnt)] #Leaky
#self.data = cstl.MapIntInt({i : i for i in range(24000000)})# Good
self.data = cstl.VecInt(range(24000000)) # Good
def __len__(self):
return self.cnt
def __getitem__(self, idx):
data = self.data[idx]
data = np.array([int(data)], dtype=np.int64)
return torch.tensor(data)
train_data = DataIter()
train_loader = DataLoader(train_data, batch_size=300,
shuffle=True,
drop_last=True,
pin_memory=False,
num_workers=18)
for i, item in tqdm(enumerate(train_loader)):
torch.cuda.empty_cache()
if i % 1000 == 0:
print(i)
Hey guyz, anyone find the root cause behind this that why memory usage rise highly with increasing num_workers, I did some research and found that memory getting in high due to tensors we are keeping in cpu memory that transferred to gpu in process, but it somehow release them after sometime, m not sure how it does that . ?