I am working on a PyTorch project built on mmdetection. In this project, the ground truths are generated from a huge file that should be loaded into memory before the training process. Illustrate in the following code:
from annotation_handler import preload_annotaiton
# ...
assert os.path.exists(cfg.data.annotation_file)
preload_annotation(cfg.data.annotation_file)
runner.trian()
In annotation_handler.py
ANNOS = dict()
def preload_annotation(path):
ANNOS.update({path: load(path)})
def get_annotaion(query, anno_path):
global ANNOS
if anno_path not in ANNOS:
ANNOS.update({path: load(path)}) # NOTE: the laod() is costly in time
gts = generate_proper_groundtruth(query, ANNOS[anno_path])
return gts
In dataset
from annotation_handler import get_annotation
class MyDataset(Dataset):
def __getitem__(self, idx):
source = load_source(idx)
query = querys[idx]
target = get_annotation(query, self.anno_path)
return source, target
My implementation expects to load the large file only once. However, I found the large annotation reloaded at each epoch’s beginning when using distributed training. And the reload time is the same as the number of workers
* the number of GPUs
. The beginning of each epoch, the data loading time increase dramatically.
- How to ensure the large annotation file is only loaded at the beginning.
- How to share the large annotation across multiple subprocesses in
torch.multiprocessing
?