Is there a way for dataloader to re-use the previous processes for all epochs for reading?
I have a customized
Dataset class that initialized a custom db each time and keeps the db instance inside a global variable to re-use between calls (for the same process db object is cached). But for each epoch it seems a new worker is spawn that then needs a new db instance.
My issue is similar to this but different enough.
import multiprocessing as mp _cur_data = None class MyDataSet(data.Dataset): def __init__(self, path): self._path = path def __getitem__(self, index): return self.db[index] def __len__(self): return len(self.db) @property def db(self): proc = mp.current_process() pid = proc.pid opid = None global _cur_data if _cur_data: opid = _cur_data if opid == pid: return _cur_data _cur_data = None db = MyDB(self._path) _cur_data = [pid, db] return db
One class of
MyDB is normal ZipFile, but I have some that need indexing at the start and instantiating them have overhead.