At a high level, how exactly does the implementation of higher order derivs work? My understanding is that you write down your layer operation in terms of simpler operations which are passed to autograd, which is a sort of black box that computes gradients. But then why do we need to explicitly write down formulas for second derivatives (e.g. https://github.com/pytorch/pytorch/blob/master/torch/nn/_functions/thnn/auto_double_backwards.py)? And how are the higher order derivatives computed?
the basic idea is to switch the backward of a function from operating on
Tensors to working with
You can see this in the files in the autograd/_functions directory and keep track of them via the autograd Variable mechanism. That way, the second derivative is just the derivative of the derivative, as it should be.
As far as I understand, the thing with the functions you linked is that it is either not so simple or not so efficient to do this directly by between the python-coded bits, the bits implemented in C, and the backend implementations.