I want to share a cache among multiple processes in the same node when using ddp training. Supposed that there are N nodes, we need create N caches separately, and these subprocesses in the same node share one of the caches.
If necessary, the processes on one node can access the data in the cache of other nodes.
I found a potential solution is to use shared memory with torch.multiprocessing, which was described in How to share data among DataLoader processes to save memory.
However, it is not convenient if I want to training on multiple nodes, thus I choose the method torch.distributed.launch
rather than mp.spawn
to init DDP training.
The question is how can I share a cache among multiple subprocesses in the same node, when using the method torch.distributed.init_process_group
rather than the method of multiprocessing.
Has someone ever encounter the same problem?