Shared data pool with DDP

CDhere · January 25, 2021, 9:04pm

Hi, I’m writing an algorithm where the data pool is affected by the outcome of the trained model at each epoch. This means that while I can use DDP to create copies of the model on GPUs, the data pool where the training samples are drawn from should be shared among all processes. Is there a good way to achieve this? How do I collect results from multiple processes to update the shared pool, before continuing to the next training epoch?

Many thanks!

rvarm1 · January 26, 2021, 12:30am

Hi, if you’re within a single node you can probably coordinate these updates with something like a multiprocessing.Manager, see multiprocessing — Process-based parallelism — Python 3.9.1 documentation.

Alternatively, if your training is across different nodes entirely, you have a few options:

Have a per-node data pool, and just proceed with the above approach
Have a global data pool that is replicated across each node with the MP manager. You can then probably use pytorch APIs such as dist.broadcast_object_list and dist.scatter_object_list to share the required data.

CDhere · January 29, 2021, 12:34am

Thank you for the suggestion! So far I’m satisfied with single-machine, multi-gpu setting. I did a bit of research and here’s a good post on how to use mp.Manager: https://stackoverflow.com/questions/10415028/how-can-i-recover-the-return-value-of-a-function-passed-to-multiprocessing-proce, in case anyone else is interested. BTW I should point out that torch.multiprocessing is just a thin layer of wrapping around Python’s native multiprocessing, so most stuff are similar and can be used directly in the same style.

Basically a dict can be shared among processes spawned, which suits my case. I guess any pickable primitives like List will also do, though.

Will be back and update this thread once I’ve extended to a multi-machine setting.

mrshenli · January 29, 2021, 3:37pm

For multiple machine data sharing, one option would be letting one process to serve as the data store, and use torch.distributed.rpc to push/pull data across processes.

External_happy · February 12, 2024, 4:06pm

Can you give more details on how you did that ?