bluehood
(Enrico Guiraud)
June 6, 2019, 1:54pm
1
Hi,
similar topic to this question : do optimizers work transparently in multiprocess runs or do I need to average the gradients of each process manually?
The imagenet example in the pytorch/examples repo does not do explicit gradient averaging between processes, but the example on distributed training in pytorch’s tutorials does.
Thanks a lot!
Enrico
1 Like
I have a similar question here . I simultaneously opened a query in pytorch/fairseq#779 to which the response was that there is built in averaging.
How about trying some black box experiments to figure out?
pietern
(Pieter Noordhuis)
June 24, 2019, 6:28am
4
If you use vanilla multiprocessing you’ll have to do this yourself. If you use it in combination with torch.nn.parallel.DistributedDataParallel
then gradient synchronization and averaging is done for you. Also see the documentation on torch.distributed
for more information.
1 Like
@pietern can you show me in the source where the average is done? for the life of me i’ve been all over the codebase and i can’t find it.
i’m looking here
raise RuntimeError("grad can be implicitly created only for scalar outputs")
new_grads.append(torch.ones_like(out))
else:
new_grads.append(None)
else:
raise TypeError("gradients can be either Tensors or None, but got " +
type(grad).__name__)
return tuple(new_grads)
def backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None):
r"""Computes the sum of gradients of given tensors w.r.t. graph leaves.
The graph is differentiated using the chain rule. If any of ``tensors``
are non-scalar (i.e. their data has more than one element) and require
gradient, then the Jacobian-vector product would be computed, in this
case the function additionally requires specifying ``grad_tensors``.
It should be a sequence of matching length, that contains the "vector"
in the Jacobian-vector product, usually the gradient of the differentiated
function w.r.t. corresponding tensors (``None`` is an acceptable value for
all tensors that don't need gradient tensors).
and
// backwards threads hold a lock, we'll probably deadlock in the engine
// destructor.
if (_reinitialize_engine) {
engine.~PythonEngine();
new (&engine) torch::autograd::python::PythonEngine();
_reinitialize_engine = false;
}
}
// Implementation of torch._C._EngineBase.run_backward
PyObject *THPEngine_run_backward(THPEngine *self, PyObject *args, PyObject *kwargs)
{
HANDLE_TH_ERRORS
_maybe_reinitialize_engine_after_fork();
PyObject *tensors = nullptr;
PyObject *grad_tensors = nullptr;
unsigned char keep_graph = 0;
unsigned char create_graph = 0;
PyObject *inputs = nullptr;
unsigned char allow_unreachable = 0;
const char *accepted_kwargs[] = {
pietern
(Pieter Noordhuis)
September 10, 2019, 11:54am
6
@makslevental It’s done not in the autograd code but in the DDP reducer code: