How does autograd work?

Hi,

I have been wondering how autograd actually works, i.e. how does it compute the gradients? I understand that it does not use any numerical methods (i.e. finite element methods etc).

Are all method’s derivative pre-defined / calculated, i.e. d/dx (x^p) = p * x^(p-1) inside Autograd or is it using some symbolic methods?

Also, do all deep learning framework (i.e. PyTorch, Tensorflow, MxNet etc) use the same method or is there different between them?

Thanks.

No symbolic methods. That would be insane. It uses calculus based on FORTRAN style and GRAPH it creates.

If you have to calculate derivative of a function y = f(a,b,c) at point a, then you will measure y changes with respect to the change of the variable a.

da = (f(a+e,b,c)-f(a,b,c))/e
print(da)

where e is some small number.

Okay but isn’t that numerical differentiation (i.e. https://en.wikipedia.org/wiki/Numerical_differentiation)?

Or is it more like Automatic differentiation using dual numbers (https://en.wikipedia.org/wiki/Automatic_differentiation)?

This video explains very well:

1 Like

Thanks, the video was great!

However, I don’t think it answers my question, i.e. how are the actual gradients calculated (i.e. how do we calculate the gradient of log(x) in an automatic way).

Hi,

Yes automatic differentiation is used.
This basically computes gradients with the chain rule: if you have y = f(x) and z = g(y).
Then you compute dz/dx = dz/dy * dy/dx (with * being a matrix multiply).

So the only thing needed to compute any gradient is to have the formula for all the elementary functions and some mechanic to know how to apply them.
You can find (most) of the formulas in this file that we use to define the mapping between a function and its derivative.
And the order is given by an acyclic graph that we create during the forward.

Hi,

Thanks!, that clarifies the approach. Do you know if that is the common way to do it (i.e. do you think Tensorflow, MxNet etc have the same approach or is it PyTorch specific)?

The chain rule-based method is also known as backpropagation and is clearly used by most DL framework. I am not sure about MxNet, but Tensorflow use it for sure.
The main difference with pytorch used to be that the acyclic graph in Tensorflow was created statically. While pytorch recreates it at each forward. But that is not true anymore with tensorflow eager mode by default.

Hi,

I meant the “hard-coding” of gradients.

Do you know if that is the common way to do it (i.e. do you think Tensorflow, MxNet etc have the same approach or is it PyTorch specific.

This paper may provide further information for you.

Or is it more like Automatic differentiation using dual numbers (https://en.wikipedia.org/wiki/Automatic_differentiation )?

Yes, AD (Automatic differentiation) first appeared in Fortran project. Fact from the paper I sent you the link.

Fairly the most outstanding articles explaining AD I found is this one from 2013.

Ho,

Yes you have to. The formula has to come from somewhere since we don’t do neither symbolic differentiation or finite differences.

Hi,

Thank you for the answers!

Thanks, will take a look at the papers.