I want to share a cache among multiple processes in the same node when using ddp training. Supposed that there are N nodes, we need create N caches separately, and these subprocesses in the same node share one of the caches.
If necessary, the processes on one node can access the data in the cache of other nodes.
I found a potential solution is to use shared memory with torch.multiprocessing, which was described in How to share data among DataLoader processes to save memory.
However, it is not convenient if I want to training on multiple nodes, thus I choose the method
torch.distributed.launch rather than
mp.spawn to init DDP training.
The question is how can I share a cache among multiple subprocesses in the same node, when using the method
torch.distributed.init_process_group rather than the method of multiprocessing.
Has someone ever encounter the same problem?