How to share a cache among multiple subprocesses when using PyTorch DDP training

cindybrain · December 22, 2021, 9:10am

I want to share a cache among multiple processes in the same node when using ddp training. Supposed that there are N nodes, we need create N caches separately, and these subprocesses in the same node share one of the caches.
If necessary, the processes on one node can access the data in the cache of other nodes.

I found a potential solution is to use shared memory with torch.multiprocessing, which was described in How to share data among DataLoader processes to save memory.
However, it is not convenient if I want to training on multiple nodes, thus I choose the method torch.distributed.launch rather than mp.spawn to init DDP training.
The question is how can I share a cache among multiple subprocesses in the same node, when using the method torch.distributed.init_process_group rather than the method of multiprocessing.

Has someone ever encounter the same problem?

mrshenli · January 4, 2022, 3:33am

Hey @cindybrain, there is an ongoing discussion on this topic. See: [RFC] TorchStore - A Shared-Memory Tensor Store · Issue #64932 · pytorch/pytorch · GitHub

Does the proposed solution address your use case?

cc @cbalioglu

cindybrain · January 5, 2022, 8:37am

Thanks for your reply!
I think it is a great proposal, and it can solve my question! Besides, I wonder why it only chooses to support tensor type in the share-memory? Can it be extended to share other data type?

cbalioglu · January 5, 2022, 6:21pm

Hi @cindybrain, happy to hear that!

In fact the underlying API will allow you to store arbitrary blobs in shared memory. We deliberately limited the scope in the RFC to have feedback on the proposal’s core functionality.