Optimizers and multiprocessing: should I manually average gradients?

bluehood · June 6, 2019, 1:54pm

Hi,
similar topic to this question: do optimizers work transparently in multiprocess runs or do I need to average the gradients of each process manually?

The imagenet example in the pytorch/examples repo does not do explicit gradient averaging between processes, but the example on distributed training in pytorch’s tutorials does.

Thanks a lot!
Enrico

bluehood · June 14, 2019, 2:17pm

First and last bump!

jerinphilip · June 14, 2019, 2:36pm

I have a similar question here. I simultaneously opened a query in pytorch/fairseq#779 to which the response was that there is built in averaging.

How about trying some black box experiments to figure out?

pietern · June 24, 2019, 6:28am

If you use vanilla multiprocessing you’ll have to do this yourself. If you use it in combination with torch.nn.parallel.DistributedDataParallel then gradient synchronization and averaging is done for you. Also see the documentation on torch.distributed for more information.

makslevental · September 5, 2019, 4:25pm

@pietern can you show me in the source where the average is done? for the life of me i’ve been all over the codebase and i can’t find it.

i’m looking here

github.com

pytorch/pytorch/blob/master/torch/autograd/init.py#L44


                    raise RuntimeError("grad can be implicitly created only for scalar outputs")
                new_grads.append(torch.ones_like(out))
            else:
                new_grads.append(None)
        else:
            raise TypeError("gradients can be either Tensors or None, but got " +
                            type(grad).__name__)
    return tuple(new_grads)




def backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None):
    r"""Computes the sum of gradients of given tensors w.r.t. graph leaves.


    The graph is differentiated using the chain rule. If any of ``tensors``
    are non-scalar (i.e. their data has more than one element) and require
    gradient, then the Jacobian-vector product would be computed, in this
    case the function additionally requires specifying ``grad_tensors``.
    It should be a sequence of matching length, that contains the "vector"
    in the Jacobian-vector product, usually the gradient of the differentiated
    function w.r.t. corresponding tensors (``None`` is an acceptable value for
    all tensors that don't need gradient tensors).

and

github.com

pytorch/pytorch/blob/master/torch/csrc/autograd/python_engine.cpp#L90


  // backwards threads hold a lock, we'll probably deadlock in the engine
  // destructor.
  if (_reinitialize_engine) {
    engine.~PythonEngine();
    new (&engine) torch::autograd::python::PythonEngine();
    _reinitialize_engine = false;
  }
}


// Implementation of torch._C._EngineBase.run_backward
PyObject *THPEngine_run_backward(THPEngine *self, PyObject *args, PyObject *kwargs)
{
  HANDLE_TH_ERRORS
  _maybe_reinitialize_engine_after_fork();
  PyObject *tensors = nullptr;
  PyObject *grad_tensors = nullptr;
  unsigned char keep_graph = 0;
  unsigned char create_graph = 0;
  PyObject *inputs = nullptr;
  unsigned char allow_unreachable = 0;
  const char *accepted_kwargs[] = {

pietern · September 10, 2019, 11:54am

@makslevental It’s done not in the autograd code but in the DDP reducer code:

github.com

pytorch/pytorch/blob/8a026d4f74b71944ac2860c315996165a40f5626/torch/csrc/distributed/c10d/reducer.cpp#L321-L322


// Prescale bucket contents to turn the global sum into the global average.
replica.contents.div_(process_group_->getSize());