Is there a way for dataloader to re-use the previous processes for all epochs for reading?
I have a customized Dataset
class that initialized a custom db each time and keeps the db instance inside a global variable to re-use between calls (for the same process db object is cached). But for each epoch it seems a new worker is spawn that then needs a new db instance.
My issue is similar to this but different enough.
import multiprocessing as mp
_cur_data = None
class MyDataSet(data.Dataset):
def __init__(self, path):
self._path = path
def __getitem__(self, index):
return self.db[index]
def __len__(self):
return len(self.db)
@property
def db(self):
proc = mp.current_process()
pid = proc.pid
opid = None
global _cur_data
if _cur_data:
opid = _cur_data[0]
if opid == pid:
return _cur_data[1]
_cur_data = None
db = MyDB(self._path)
_cur_data = [pid, db]
return db
One class of MyDB
is normal ZipFile, but I have some that need indexing at the start and instantiating them have overhead.