Loading huge data functionality

Zhilin_Lu · November 23, 2019, 2:19pm

@apaszke Hi, I am a new comer for this topic and thank you for all the responsible replys.

However my situation is not mentioned in the discussion. Now I got a serialized video labels data, like the frame ID + bbox + keypoint. NO actural image data is included so each data frame is really light. The problem is I got millions of frames and they add up to about 60G.

My server has a total CPU memory of around 250G, which seems to be quite enough. But when I am using DDP in pytorch, the data is indenpendently loaded by 8 processes, which gives a OOM error. (since 60 * 8 >> 250)

The datas are in form of protobuf, and it is not practical to split it since the algorithm requires to truncate part of the dataframe. An intergrated dataset should be accessible to any of the 8 processes in DDP (started by torch.distributed.launch)

In conclusion I merely need 60G memory space but DDP starts eight independent processes and copies 8 ectypes of it, which causes an out-of-memory. I would like to ask if there are any way to share the same CPU memory across all processes started by torch launcher manually, thanks a lot~

jun_zhou · December 11, 2019, 7:38am

When I used dataset and dataloader to load the data, I found that the data read using the code below would not release their CPU memory at the beginning of the next iteration, and the memory usage reached the highest point with the end of the first epoch. At the time, when my data set exceeded the machine memory, there would be an oom error. How can I avoid it?

# -*- coding: utf-8 -*-
# @Time    : 2019/8/23 21:54
# @Author  : zhoujun
import pathlib

import cv2
import numpy as np
import scipy.io as sio
from torch.utils.data import DataLoader,Dataset


class DemoDataset(Dataset):
    def __init__(self, transform=None, **kwargs):
        self.data_list = np.zeros((100,640,640,3))
        self.transform = transform


    def  __getitem__(self,idx):
        img = self.data_list[idx]
        if self.transform is not None:
            img = self.transform(img)
        return img
    
    def __len__(self):
        return len(self.data_list)


if __name__ == '__main__':
    import time
    from torchvision import transforms

    train_data = DemoDataset(transform=transforms.ToTensor())
    train_loader = DataLoader(dataset=train_data, batch_size=1, shuffle=True, num_workers=0)
    for e in range(2):
        start = time.time()
        for i, data in enumerate(train_loader):
            if e==0 and i == len(train_loader) - 1:
                print(time.time()-start)

memory jpg

run scipt, and the figure will show

 mprof run demo.py
 mprof plot

ptrblck · December 11, 2019, 3:59pm

Is the plot showing the GPU memory or the CPU RAM?
I’m not familiar with mprof, but does it show the complete memory usage of the script, i.e. with loaded libraries?
You are initializing the data_list in your __init__ method, which will take approx. 468MB.

jun_zhou · December 12, 2019, 12:20am

plot showing the CPU RAM

ptrblck · December 12, 2019, 5:25am

What does the graph show, if you reduce the data_list to a tensor containing a single float value?

jun_zhou · December 12, 2019, 5:42am

the graph with follow code

# -*- coding: utf-8 -*-
# @Time    : 2019/8/23 21:54
# @Author  : zhoujun
import numpy as np
from torch.utils.data import DataLoader,Dataset


class DemoDataset(Dataset):
    def __init__(self):
        pass

    def  __getitem__(self,idx):
        img = np.zeros((1,3,640,640))
        return img
    
    def __len__(self):
        return 10000


if __name__ == '__main__':
    import time
    train_data = DemoDataset()
    train_loader = DataLoader(dataset=train_data, batch_size=1, num_workers=0)
    start_run = time.time()
    for e in range(10):
        for i, img in enumerate(train_loader):
            if i == 0:
                print('epoch: {} start time: {}'.format(e,time.time()-start_run))

jun_zhou · December 12, 2019, 7:12am

# -*- coding: utf-8 -*-
# @Time    : 2019/8/23 21:54
# @Author  : zhoujun
import numpy as np
from torch.utils.data import DataLoader,Dataset


class DemoDataset(Dataset):
    def __init__(self):
        pass

    def  __getitem__(self,idx):
        img = 1#np.zeros((1,3,640,640))
        return img
    
    def __len__(self):
        return 10000


if __name__ == '__main__':
    import time
    train_data = DemoDataset()
    train_loader = DataLoader(dataset=train_data, batch_size=1, num_workers=0)
    start_run = time.time()
    for e in range(10):
        for i, img in enumerate(train_loader):
            if i == 0:
                print('epoch: {} start time: {}'.format(e,time.time()-start_run))

rui_zhang_331 · October 30, 2020, 12:21am

Hi @apaszke, I use same way to load 1K+ files via dataloader, so each dataloader __getindex__ will return one file, however, I always get error like

File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/home/miniconda/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 90300) is killed by signal: Killed.

seems each worker in data loader has either memory or time limit? if we put some parsing logic there(in our case we read parquet file), it will break?

I have checked the the source code for dataloader, basically we could set a timeout

   def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
        # Tries to fetch data from `self._data_queue` once for a given timeout.
        # This can also be used as inner loop of fetching without timeout, with
        # the sender status as the loop condition.
        #
        # This raises a `RuntimeError` if any worker died expectedly. This error
        # can come from either the SIGCHLD handler in `_utils/signal_handling.py`
        # (only for non-Windows platforms), or the manual check below on errors
        # and timeouts.
        #
        # Returns a 2-tuple:
        #   (bool: whether successfully get data, any: data if successful else None)
        try:
            data = self._data_queue.get(timeout=timeout)

the default is 5s,

MP_STATUS_CHECK_INTERVAL = 5.0
r"""Interval (in seconds) to check status of processes to avoid hanging in
    multiprocessing data loading. This is mainly used in getting data from
    another process, in which case we need to periodically check whether the
    sender is alive to prevent hanging."""

what is this interval means? I try to increase timeout in multi worker definition but get same error. only difference is final error msg will be like

RuntimeError: DataLoader worker (pid 320) is killed by signal: Segmentation fault.