DataLoader re-use worker processes

dashesy · January 8, 2019, 1:00am

Is there a way for dataloader to re-use the previous processes for all epochs for reading?
I have a customized Dataset class that initialized a custom db each time and keeps the db instance inside a global variable to re-use between calls (for the same process db object is cached). But for each epoch it seems a new worker is spawn that then needs a new db instance.
My issue is similar to this but different enough.

import multiprocessing as mp
_cur_data = None

class MyDataSet(data.Dataset):
    def __init__(self, path):
        self._path = path
    def __getitem__(self, index):
        return self.db[index]
    def __len__(self):
        return len(self.db)
    @property
    def db(self):
        proc = mp.current_process()
        pid = proc.pid
        opid = None
        global _cur_data
        if _cur_data:
            opid = _cur_data[0]
            if opid == pid:
                return _cur_data[1]
            _cur_data = None
        db = MyDB(self._path)
        _cur_data = [pid, db]
        return db

One class of MyDB is normal ZipFile, but I have some that need indexing at the start and instantiating them have overhead.

dashesy · January 9, 2019, 12:23am

Created a feature request here. If we could select loky for multi-processing that would be great.