Multiprocessing total loss and best practices

Amit_Gupta · November 17, 2021, 5:00am

I am trying to run simple regression example with pytorch multiprocessing. I am following the example here: Multiprocessing best practices — PyTorch 1.10.0 documentation

However few things are unclear to me. In the example page it is written:

    for data, labels in data_loader:
        optimizer.zero_grad()
        loss_fn(model(data), labels).backward()
        optimizer.step()  # This will update the shared parameters

What is meant by update the shared parameters?
Do they update on individual processes? or as the model is shared (model.shared_memory() achieves that right?) it updates the shared copy of parameters?
By default model parameters must be shared right? as it says in the note If torch.Tensor.grad is not None, it is also shared.
If it updates the shared copy, then is the backwards call calculating loss over single thread or sum over losses from all threads? If it is total loss then how can I print it?
If the model contain a layers which has conditional branches, how will optimizer update parameters?
Each process saves its own backward tree? If yes then how are parameters updated?

Sorry for the question dump but couldn’t find proper answers as tutorials and documentation on this was rather scarce.

rvarm1 · November 17, 2021, 7:42pm

Regarding 1 & 2, yes the shared parameters will be updated. With the call to model.share_memory parameter tensors are put into SHM, so each process is actually operating on the same model.

The wording is probably not the best, but that part means the grad tensor is also shared as long as grad is not none. If you’ve called model.shared_memory as example does, model params will always be shared.
There aren’t any application-level threads here, loss is computed per-process so it can be different across each process. So it is not total loss.
If the model has conditional branches, autograd automatically handles this by constructing the backward graph on the fly during forward pass.
Backward is run locally on each process, there is no inter-process coordination. The .grad field is updated in a shared way across each process, this can lead to some processes stepping on each other’s updates, but this is a result of hogwild training.