Numerical differences between manually computed gradients and Pytorch autograd

I asked this question on stack a while ago:

Does someone have an explanation ? Are there some operations that are more numerically stable, that I should use ?
What are the order of magnitude in the relative differences that should be expected ? Thanks again.


I haven’t looked very closely to the code, but the first thing is that with floating points, the precision is going to be very very bad. All the tests we do for gradients are actually done in double precision to be sure.

Yeah I know that you usually check things like finite differences approximation against autograd in float64 however I thought that since I am computing analytical expression against analytical expression for a very small network it should not matter that much. But I am no expert on single precision calculus and I was wondering if things like the order of the multiply/sum mattered and stuff like that and how easy it was to make things match perfectly.

float operation are not associative I’m afraid. So it’s going to be impossible to make them match exactly unless you know exactly in which order each op is executed.
From that point, depending on the other values, the small difference might increase quite fast.

Your code looks ok.
I would compute the difference between the prediction and the ground truth once and reuse it.
You can check the intermediary results to see if it’s just an error getting bigger and bigger.

OK thanks for your advices !