Parallel online policy gradient on Module level with torch.multiprocessing

What would be the right way to do the following, probably using torch.multiprocessing:

Have one torch.nn.Module. Have it train in parallel on multiple CPU cores. Each thread or process computes policy gradients online on a couple of episodes in some RL environment. Gradients are summed or averaged and the optimizer update step is done synchronously.

I could copy the module object to subprocesses and manually collect and combine the gradients, but I feel like there should be a clean and simple solution. I’m looking for something like in the torch.distributed tutorial, just on a single machine and without the use of a distributed communication framework.

Does this example using model.share_memory() helps?