Multiprocessing: `optim.step()` in each subprocess vs. once in main process

I am doing RL where episodes are run in parallel using multiprocessing. Here is some minimal pseudo-code that reflects the workflow, where optim.step() is called in each subprocess:

def episode_runner(in_q, out_q, optim, model):
    env = MyEnv()
    while episode_data := in_q.get():
        trajectory = run_episode(env, episode_data, model)

        optim.zero_grad()
        loss = compute_loss(trajectory)
        loss.backward()
        optim.step()
        
        out_q.put(loss.item())


def launch_subprocesses(n_procs, optim, model):
    procs = []
    in_qs = []
    out_qs = []
    for i in range(n_procs):
        in_q = SimpleQueue()
        in_qs += [in_q]
        out_q = SimpleQueue()
        out_qs += [out_q]
        proc = Process(
            target=episode_runner, 
            args=(in_q, out_q, optim, model))
        proc.start()
    return procs, in_qs, out_qs


def terminate_subprocesses(in_qs, procs):
    for in_q in in_qs:
        in_q.put(None)

    for proc in procs:
        proc.join()


def training_loop(n_procs, optim, model, n_epochs):
    procs, in_qs, out_qs = \
        launch_subprocesses(n_procs, optim, model)

    for epoch in range(n_epochs):
        episode_data = generate_data()

        for in_q in in_qs:
            in_q.put(episode_data)

        losses = [out_q.get() for out_q in out_qs]

    terminate_subprocesses(in_qs, procs)

Alternatively, rather than calling optim.step() within each subprocess, I can call it once in the main process after letting the gradients accumulate:

def episode_runner(in_q, out_q, model):
    env = MyEnv()
    while episode_data := in_q.get():
        trajectory = run_episode(env, episode_data, model)

        loss = compute_loss(trajectory)
        loss.backward()

        out_q.put(loss.item())


def training_loop(n_procs, optim, model, n_epochs):
    procs, in_qs, out_qs = \
        launch_subprocesses(n_procs, model)

    for epoch in range(n_epochs):
        episode_data = generate_data()

        for in_q in in_qs:
            in_q.put(episode_data)

        optim.zero_grad()
        losses = [out_q.get() for out_q in out_qs]
        optim.step()

    terminate_subprocesses(in_qs, procs)

My model is learning properly with either approach, and so I wonder: which one should I prefer? Does it make any difference?

Hi @ArchieGertsman
I guess both are fine but there are a few caveats with those approaches.
One is that you are likely to update the policy while running it in the environment on another process.
If you’re in an on-policy setting this may be an issue (although I assume that you’re only doing small update steps).
Another issue with approach 1 is that optimizers are not supposed to be dispatched across processes like this IIRC. What you will be doing if you are using Adam for instance is using a different tracker for the running average and 2nd moment of your variables on each process, which is suboptimal. Approach 2 should not have this problem, but it’s more sync than 1 (hence perhaps a bit slower to run?)

I would suggest to have a look at torch distributed doc to see what the best practice is to share a model across workers.

Hope that helps!

Thanks @vmoens! You bring up some good points, and reading the distributed doc was helpful.