I have been recently interested in understanding the more practical part of automatic differentiation. However, I have realized that a naive implementation of several operations using basic operations can be inefficient instead of directly coding up the derivative.
So for example the matrix product:
a = Wx
is in essence a combination of multiplication and summations. Hence we could in principle apply reverse-mode autodiff to compute the required derivatives. This would for sure make the graph very big and ineficient; when the gradient of a w.r.t x is simply W, hence computing the chain rule can be done direclty hardcoding the derivative of this matrix product.
My question is: does autograd consider this for common operations as the one I have exemplified? I guess so but would like to be certain about how a state of the art toolkit handle this things.
The straightforward question here is: How does it handle batched matrix operations? Is it a hardcoded gradient or it basically expand/copy the matrix involved? (I guess is the later).
Any resource to get this information?