Hi, I would like to ask a simple question about how autodiff works for vector/matrix. For an instance, if we have C = A.*B where A, B, C are all matrices. When calculating the jacobian matrix of C w.r.t A. does autodiff expand C=A.*B into C_ij= A_ij * B_ij and calculate derivative, or autodiff keeps a rule about this and directly form a result? Thanks in advance for the help.

PyTorch builds a graph of operations to do during the backward pass.

To use a more Pythonic notation, you can do:

```
A = torch.randn(5, 5, requires_grad=True)
B = torch.randn(5, 5, requires_grad=True)
C = A @ B
print(C.grad_fn)
```

The `MmBackward`

object contains references to A, B (`C.grad_fn._saved_self`

and `C.grad_fn._saved_mat2`

) as well as the instructions how to matrix multiply the Jacobian to a vector in the sense of applying the chain rule (try `C.grad_fn(grad_C)`

with `grad_C`

having the same shape as `C`

).

Best regards

Thomas

Thank you, Thomas.

Just to confirm, autodiff calculate the gradient of the matrix instead of doing element-wise deviation (e.g. break down Hadamard product and matrix multiplication into element-wise operation and differentiate). Is my understanding correct?

Thereâ€™s a tensor-level rule for each core op. If you did per-element differentiation, it would be hard to automatically â€śrerollâ€ť the result back into GPU-efficient operations. Hence when you add a core-level operation working on tensors, you also add a hand-coded derivative for that op that works on tensors (more specifically, vector-jacobian-product).

You can see the issue of doing per-element differentiation by looking at standard derivative results for matrix operations here. Derivatives of SVD, matrix inverse, and determinant have simple expressions in matrix form, but you canâ€™t easily get this form by doing per-element differentiation.

To add to @Yaroslav_Bulatov 's great explanation, the PyTorch derivatives are defined in derivatives.yaml. This is what you see in `t.grad_fn`

for tensors computed with gradient requriements.

For functions not in there, the story is a bit more elaborate: these run under autograd and PyTorch records the operations, just like it does for your code. For example `einsum`

will is implemented by calling `permute`

, `reshape`

, and `bmm`

, so these are what youâ€™ll see when looking at `.grad_fn`

or traversing the graph there (by using `.grad_fn.next_functions`

etc. appropriately).

Best regards

Thomas