How to properly use Pytorch multiprocessing to accelerate model training?

I try to accelerate my model training by using multiple CPUs. My code is like

def train(model, rank):
    for it,batch in enumerate(data_loader):
        optimizer.zero_grad()
        loss(model(batch)).backward()
        print("iter: {} at core: {}, current loss: {:10.4f}".format(it, rank, loss.item()/len(batch)), flush=True)
        optimizer.step()

model = nnet() # inherit from nn.Module
num_processes = 3
model.share_memory()
processes = []
for rank in range(num_processes):
    p = mp.Process(target=train, args=(func, rank))
    p.start()
    processes.append(p)
for p in processes:
    p.join()

and the output looks like (whether I select num_workers = 0 or 3 in dataloader)

iter: 0 at core: 0, current loss: 144.0524
iter: 0 at core: 1, current loss: 144.0524
iter: 0 at core: 2, current loss: 144.0524
iter: 1 at core: 0, current loss: 137.5519
iter: 1 at core: 1, current loss: 137.6043
iter: 1 at core: 2, current loss: 137.6081
iter: 2 at core: 0, current loss: 131.0690
iter: 2 at core: 1, current loss: 131.0943
iter: 2 at core: 2, current loss: 131.0782

It seems they independently train the model (each core did their own backpropagation) instead of trying to combine the gradient from other sub-processes to do backpropagation in each iteration (given the starting parameter values during loss calculation are the same). This does not seem to decrease the training time.
I would like to know how I should use multiple CPUs to accelerate training in Pytorch properly.

Thank you.

Edit: Even worse, multiprocessing updates parameter values incorrectly. For example, core 1 and core 2 use parameter p1 to calculate loss and gradient. When core 1 finished and updated the parameter value to p2, core 2 will update parameter based on p2 instead of p1 even though core 2 uses parameter p1 to calculate their loss and gradient! Also, they use exactly the same data from dataloader in each iteration to calculate loss and gradient (cannot be fixed easily by randomly sampling data because each core also uses the same random seed).

1 Like

I would say that there is no such a way already implemented.
The tool you are looking for is DataParallel but that’s implemented only for GPU parallelization.

You can try to dig into that code to see if you can manually emulate it for CPU.
I think the standard pipeline is computing a forward-backward, average gradients and use optim.step()