Hi, I would like to ask a simple question about how autodiff works for vector/matrix. For an instance, if we have C = A.*B where A, B, C are all matrices. When calculating the jacobian matrix of C w.r.t A. does autodiff expand C=A.*B into C_ij= A_ij * B_ij and calculate derivative, or autodiff keeps a rule about this and directly form a result? Thanks in advance for the help.
PyTorch builds a graph of operations to do during the backward pass.
To use a more Pythonic notation, you can do:
A = torch.randn(5, 5, requires_grad=True)
B = torch.randn(5, 5, requires_grad=True)
C = A @ B
print(C.grad_fn)
The MmBackward
object contains references to A, B (C.grad_fn._saved_self
and C.grad_fn._saved_mat2
) as well as the instructions how to matrix multiply the Jacobian to a vector in the sense of applying the chain rule (try C.grad_fn(grad_C)
with grad_C
having the same shape as C
).
Best regards
Thomas
Thank you, Thomas.
Just to confirm, autodiff calculate the gradient of the matrix instead of doing element-wise deviation (e.g. break down Hadamard product and matrix multiplication into element-wise operation and differentiate). Is my understanding correct?
There’s a tensor-level rule for each core op. If you did per-element differentiation, it would be hard to automatically “reroll” the result back into GPU-efficient operations. Hence when you add a core-level operation working on tensors, you also add a hand-coded derivative for that op that works on tensors (more specifically, vector-jacobian-product).
You can see the issue of doing per-element differentiation by looking at standard derivative results for matrix operations here. Derivatives of SVD, matrix inverse, and determinant have simple expressions in matrix form, but you can’t easily get this form by doing per-element differentiation.
To add to @Yaroslav_Bulatov 's great explanation, the PyTorch derivatives are defined in derivatives.yaml. This is what you see in t.grad_fn
for tensors computed with gradient requriements.
For functions not in there, the story is a bit more elaborate: these run under autograd and PyTorch records the operations, just like it does for your code. For example einsum
will is implemented by calling permute
, reshape
, and bmm
, so these are what you’ll see when looking at .grad_fn
or traversing the graph there (by using .grad_fn.next_functions
etc. appropriately).
Best regards
Thomas