According to the documentation - “Backward - … should return as many Variables as there were inputs, with each of them containing the gradient w.r.t. its corresponding input.”
What if I have multiple outputs? Can I implement a different partial derivative for each output?
You can only use backward on one single scalar value at the time. Most of the time, this value represents a distance between your vector of outputs and the expected targets.
The thing is, you don’t want to compute the derivative of your outputs, but the derivative of an error function of your outputs.
Of course, you may have several different errors (this is the case of GAN), or maybe a vector of distances (I never saw it but why not?). In that case, the common way is to call backward and to make one step of gradient descent, for each scalar component of them, one at the time.
Thanks for the reply, though you misinterpreted my question.
I was referring to class torch.autograd.Function.backward(), that is - the function of a new class extending torch.autograd.Function (see the link from the original post).
In that case, the example proposed in the doc works for a single Tensor (with several values), but may not work if you have multiple tensors with different dimensions.
But it’s a matter of how you design your function. If you want a function that outputs different things, you can create one function for each output, and one backward method for each of them. Then, you combine all the ‘sub-functions’ together in a Module:
Yeah, that makes sense. Though in my case the two forwards are identical, this would require, I suppose, strange workarounds to avoid redundant computation. Was hoping for a more build-in solution.
By the way, the context is nearest-embed layer for the VQ-VAE model.
I actually need two identical outputs: one with the encoder input detached (stop-gradient) and one with the dictionary input detached.
Then, why don’t you stack your two outputs? Then you can treate them as a single tensor, and you just divide them inside the backward in order to treate them differently:
The difference between the two outputs is the computational graph I want attached to them. I don’t think the concatenation achieves this effect.
Either way, if there isn’t a straightforward method of achieving this, I’ll keep the double forward pass for now…
Yeah, very nice concept with intriguing results. Hoping my implementation will get close to that