Num_workers in DataLoader will increase memory usage?

I am using DataLoader to load my training data.
According to the document, we can set num_workers to set the number of subprocess to speed up the loading process.
However, when I use it like this:

dataloader = DataLoader(dataset, batch_size=args.batch_size,
shuffle=True, drop_last=False, num_workers=10,
collate_fn=dataset.collate_fn)

I found the memory usage keep growing, which is not happening when I set num_worker=0.
Is this supposed behavior ? I have limited memory resources, so I don’t want the memory usage keeps growing.

P.S. My code is running on the GPU, every time I move a batch of data from cpu to gpu.

1 Like

how does your dataset code look like?

I have the same problem as him. Here is my topic

Please help us

I think that when using multiple workers, each of them loads its data on the cpu before sending it to gpu. Which makes the amount of data in memory bigger. Also it creates a new process per worker.

2 Likes

I found other similar question like this. Here is one of example.


In this thread, they said it is fixed recent day at:

I am fixing my local torch’s code. Hopefully I help

I am facing this issue even with the updated PyTorch nightly version. The Dataloder memory usage continuously increases until it runs of memory. My Dataset size is 26GB when initialized, it contains an ndarray from which I return an element based on index value. After running on 10% of data it ends up using another 30+GB of ram and 40GB + swap space. I tried upgrading the PyTorch with the latest nightly version. Tried both on python3.6 and Python3.7, but the issue persists.

Thanks,
Vishnu

Now, for PyTorch 1.2.0, multiple workers don’t make multiple memory copy of dataloader object. But still such problems some times. I find that setting num_worker < physical cpu kernels works fine.

How do you implement your dataset?
I am using PyTorch 1.5.0, but it still make memory copy, based on my observation.

I just load the whole dataset in the init() of DataLoader, others are just as normal. Be sure to make num_worker < physical cpu . I don’t know why, but it works well when your memory is limited. E.g., my dataset is about 30 GB, I can run on a two-GPU machine with 80 GB memory. Num_worker and physical cpu are 10 and 12 respectively.

It is a feature of Python. You may encounter this problem when using native Python structures like map or list. I have developed a NEW TOOL called cstl (GitHub - fuzihaofzh/cstl: The C++ Standard Template Library (STL) for Python.) by wrapping C++ STL containers to solve this issue. It supports multiple types including nested maps and nested lists which the numpy and pytorch do not support.
Here is a simple example showing how it solves the problem:

from torch.utils.data import Dataset, DataLoader
import numpy as np
import torch
import copy
import sys
import cstl
from tqdm.auto import tqdm


class DataIter(Dataset):
    def __init__(self):
        cnt = 24000000
        self.cnt = cnt
        #self.data = np.array([x for x in range(cnt)]) # Good
        #self.data = [x for x in range(cnt)] #Leaky
        #self.data = cstl.MapIntInt({i : i for i in range(24000000)})# Good
        self.data = cstl.VecInt(range(24000000)) # Good

        
    def __len__(self):
        return self.cnt

    def __getitem__(self, idx):
        data = self.data[idx]
        data = np.array([int(data)], dtype=np.int64)
        return torch.tensor(data)

train_data = DataIter()
train_loader = DataLoader(train_data, batch_size=300,
                          shuffle=True,
                          drop_last=True,
                          pin_memory=False,
                          num_workers=18)

for i, item in tqdm(enumerate(train_loader)):
    torch.cuda.empty_cache()
    if i % 1000 == 0:
        print(i)