Adam with multiprocessing

I am following the Hogwild example from multiprocessing best practices:

import torch.multiprocessing as mp
from model import MyModel

def train(model):
    # Construct data_loader, optimizer, etc.
    for data, labels in data_loader:
        optimizer.zero_grad()
        loss_fn(model(data), labels).backward()
        optimizer.step()  # This will update the shared parameters

if __name__ == '__main__':
    num_processes = 4
    model = MyModel()
    # NOTE: this is required for the ``fork`` method to work
    model.share_memory()
    processes = []
    for rank in range(num_processes):
        p = mp.Process(target=train, args=(model,))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

and I’ve adapted it in the context of my reinforcement learning code. The way I construct my optimizer determines whether or not the model learns correctly.

What works: I define optim = Adam(model.parameters(), lr=.005) in the main process and pass it into train when creating the processes, i.e. p = mp.Process(target=train, args=(model, optim)). Then, at the start of train, I make a copy of optim local to each subprocess, i.e. optim = deepcopy(optim).

What doesn’t work: I directly define optim = Adam(model.parameters(), lr=.005) inside train. Note: when I do this with SGD, the model does learn!

Is there an obvious explanation as to why doesn’t Adam work in the second approach?

Cc @VitalyFedyunin, wondering if you have any idea, thanks!