Hi, I’m writing an algorithm where the data pool is affected by the outcome of the trained model at each epoch. This means that while I can use DDP to create copies of the model on GPUs, the data pool where the training samples are drawn from should be shared among all processes. Is there a good way to achieve this? How do I collect results from multiple processes to update the shared pool, before continuing to the next training epoch?
Hi, if you’re within a single node you can probably coordinate these updates with something like a
multiprocessing.Manager, see multiprocessing — Process-based parallelism — Python 3.9.1 documentation.
Alternatively, if your training is across different nodes entirely, you have a few options:
- Have a per-node data pool, and just proceed with the above approach
- Have a global data pool that is replicated across each node with the MP manager. You can then probably use pytorch APIs such as
dist.scatter_object_list to share the required data.
Thank you for the suggestion! So far I’m satisfied with single-machine, multi-gpu setting. I did a bit of research and here’s a good post on how to use
mp.Manager: https://stackoverflow.com/questions/10415028/how-can-i-recover-the-return-value-of-a-function-passed-to-multiprocessing-proce, in case anyone else is interested. BTW I should point out that
torch.multiprocessing is just a thin layer of wrapping around Python’s native
multiprocessing, so most stuff are similar and can be used directly in the same style.
Basically a dict can be shared among processes spawned, which suits my case. I guess any pickable primitives like List will also do, though.
Will be back and update this thread once I’ve extended to a multi-machine setting.
For multiple machine data sharing, one option would be letting one process to serve as the data store, and use
torch.distributed.rpc to push/pull data across processes.