I am trying to reproduce the results reported in Decoupled Neural Interfaces using Synthetic Gradients paper by PyTorch. The key idea in this paper is to approximate the gradient of the loss, using a synthetic gradient, gk, so that:
After calculating the gradient of the loss, I need to update the weights of the upstream layers. I am currently doing it this way:
But I am not sure if this is correct and the documentation is not that helpful.
Yes, this will be correct, assuming that
output is the
gradient_of_the_loss is the
g_k in the expression
(One quibble about your expression: The index
be repeated in
d h_k / d theta_k. I would have written
d L / d theta_l ~ g_k d h_k / d theta_l, with an
implied sum over the index
Thank you Frank for the response!
The output is h_k, but by gradient_of_the_loss I meant the right hand side of the expression, i.e.:
Because I think the backward method needs the gradient tensor as the option, which is built by multiplying synthetic gradient and the derivatives of activations w.r.t. weights.
I don’t understand what you are saying here.
With the exception of the duplicated index that I assume is a typo (see
my previous post), I think that the
in your original post is correct.
The way I read the image of the equation in your original post,
L is the
scalar loss computed by applying a loss function to the predictions of your
theta_l are the “upstream” parameters of that model. (I use the
l rather than
k to fix the duplicated-index typo.)
h_k are the
output of your model, that is, its predictions.
d L / d h_k is the true gradient of the loss function with respect to the
g_k is the “synthetic” gradient that approximates the
The equation you posted is just the (multi-dimensional) chain rule for
computing the gradient of
L with respect to
Assuming that by
gradient_of_the_loss you mean either
d L / d h_k
(where, again, the
h_k are the predictions made by the model) or its
approximate, synthetic version,
g_k, then whether you compute
gradient_of_the_loss as the true or as the synthetic gradient and
whether you compute it using autograd or by some other means, the way
you get autograd to use the chain rule to complete the computation of
d L / d theta_l is precisely by calling: