How does autograd work?

MrParosk · November 21, 2019, 8:55am

Hi,

I have been wondering how autograd actually works, i.e. how does it compute the gradients? I understand that it does not use any numerical methods (i.e. finite element methods etc).

Are all method’s derivative pre-defined / calculated, i.e. d/dx (x^p) = p * x^(p-1) inside Autograd or is it using some symbolic methods?

Also, do all deep learning framework (i.e. PyTorch, Tensorflow, MxNet etc) use the same method or is there different between them?

Thanks.

dejanbatanjac · November 21, 2019, 10:10am

No symbolic methods. That would be insane. It uses calculus based on FORTRAN style and GRAPH it creates.

If you have to calculate derivative of a function y = f(a,b,c) at point a, then you will measure y changes with respect to the change of the variable a.

da = (f(a+e,b,c)-f(a,b,c))/e
print(da)

where e is some small number.

MrParosk · November 21, 2019, 12:05pm

Okay but isn’t that numerical differentiation (i.e. https://en.wikipedia.org/wiki/Numerical_differentiation)?

Or is it more like Automatic differentiation using dual numbers (https://en.wikipedia.org/wiki/Automatic_differentiation)?

tcapelle · November 21, 2019, 12:09pm

This video explains very well:

MrParosk · November 21, 2019, 3:12pm

Thanks, the video was great!

However, I don’t think it answers my question, i.e. how are the actual gradients calculated (i.e. how do we calculate the gradient of log(x) in an automatic way).

albanD · November 21, 2019, 3:26pm

Hi,

Yes automatic differentiation is used.
This basically computes gradients with the chain rule: if you have y = f(x) and z = g(y).
Then you compute dz/dx = dz/dy * dy/dx (with * being a matrix multiply).

So the only thing needed to compute any gradient is to have the formula for all the elementary functions and some mechanic to know how to apply them.
You can find (most) of the formulas in this file that we use to define the mapping between a function and its derivative.
And the order is given by an acyclic graph that we create during the forward.

MrParosk · November 21, 2019, 4:44pm

Hi,

Thanks!, that clarifies the approach. Do you know if that is the common way to do it (i.e. do you think Tensorflow, MxNet etc have the same approach or is it PyTorch specific)?

albanD · November 21, 2019, 4:48pm

The chain rule-based method is also known as backpropagation and is clearly used by most DL framework. I am not sure about MxNet, but Tensorflow use it for sure.
The main difference with pytorch used to be that the acyclic graph in Tensorflow was created statically. While pytorch recreates it at each forward. But that is not true anymore with tensorflow eager mode by default.

MrParosk · November 21, 2019, 5:01pm

Hi,

I meant the “hard-coding” of gradients.

dejanbatanjac · November 21, 2019, 5:08pm

Do you know if that is the common way to do it (i.e. do you think Tensorflow, MxNet etc have the same approach or is it PyTorch specific.

This paper may provide further information for you.

Or is it more like Automatic differentiation using dual numbers (https://en.wikipedia.org/wiki/Automatic_differentiation )?

Yes, AD (Automatic differentiation) first appeared in Fortran project. Fact from the paper I sent you the link.

Fairly the most outstanding articles explaining AD I found is this one from 2013.

albanD · November 21, 2019, 5:43pm

Ho,

Yes you have to. The formula has to come from somewhere since we don’t do neither symbolic differentiation or finite differences.

MrParosk · November 22, 2019, 7:11am

Hi,

Thank you for the answers!

MrParosk · November 22, 2019, 7:12am

Thanks, will take a look at the papers.