What would be the right way to do the following, probably using torch.multiprocessing:
Have one torch.nn.Module. Have it train in parallel on multiple CPU cores. Each thread or process computes policy gradients online on a couple of episodes in some RL environment. Gradients are summed or averaged and the optimizer update step is done synchronously.
I could copy the module object to subprocesses and manually collect and combine the gradients, but I feel like there should be a clean and simple solution. I’m looking for something like in the torch.distributed tutorial, just on a single machine and without the use of a distributed communication framework.