How autodiff works for matrix

SakuraTea · November 9, 2021, 3:57pm

Hi, I would like to ask a simple question about how autodiff works for vector/matrix. For an instance, if we have C = A.*B where A, B, C are all matrices. When calculating the jacobian matrix of C w.r.t A. does autodiff expand C=A.*B into C_ij= A_ij * B_ij and calculate derivative, or autodiff keeps a rule about this and directly form a result? Thanks in advance for the help.

tom · November 9, 2021, 8:25pm

PyTorch builds a graph of operations to do during the backward pass.

To use a more Pythonic notation, you can do:

A = torch.randn(5, 5, requires_grad=True)
B = torch.randn(5, 5, requires_grad=True)
C = A @ B
print(C.grad_fn)

The MmBackward object contains references to A, B (C.grad_fn._saved_self and C.grad_fn._saved_mat2) as well as the instructions how to matrix multiply the Jacobian to a vector in the sense of applying the chain rule (try C.grad_fn(grad_C) with grad_C having the same shape as C).

Best regards

Thomas

SakuraTea · November 10, 2021, 2:55am

Thank you, Thomas.

Just to confirm, autodiff calculate the gradient of the matrix instead of doing element-wise deviation (e.g. break down Hadamard product and matrix multiplication into element-wise operation and differentiate). Is my understanding correct?

Yaroslav_Bulatov · November 11, 2021, 12:40am

There’s a tensor-level rule for each core op. If you did per-element differentiation, it would be hard to automatically “reroll” the result back into GPU-efficient operations. Hence when you add a core-level operation working on tensors, you also add a hand-coded derivative for that op that works on tensors (more specifically, vector-jacobian-product).

You can see the issue of doing per-element differentiation by looking at standard derivative results for matrix operations here. Derivatives of SVD, matrix inverse, and determinant have simple expressions in matrix form, but you can’t easily get this form by doing per-element differentiation.

tom · November 12, 2021, 7:13am

To add to @Yaroslav_Bulatov 's great explanation, the PyTorch derivatives are defined in derivatives.yaml. This is what you see in t.grad_fn for tensors computed with gradient requriements.
For functions not in there, the story is a bit more elaborate: these run under autograd and PyTorch records the operations, just like it does for your code. For example einsum will is implemented by calling permute, reshape, and bmm, so these are what you’ll see when looking at .grad_fn or traversing the graph there (by using .grad_fn.next_functions etc. appropriately).

Best regards

Thomas