Interpreting gradcheck errors

Hi, sorry for the basic question.

Gradcheck seems to give 2 separate outputs on a failure. Numerical, and Analytical.

What is the difference between the two? And how would I use this information to find a problem in my model?

Example output from a failed gradcheck…

RuntimeError: for output no. 0,
numerical:(
1 0 0 … 0 0 0
0 1 0 … 0 0 0
0 0 1 … 0 0 0
… ⋱ …
0 0 0 … 1 0 0
0 0 0 … 0 1 0
0 0 0 … 0 0 1
[torch.FloatTensor of size 150x150]
,)
analytical:(
0 0 0 … 0 0 0
0 0 0 … 0 0 0
0 0 0 … 0 0 0
… ⋱ …
0 0 0 … 0 0 0
0 0 0 … 0 0 0
0 0 0 … 0 0 0
[torch.FloatTensor of size 150x150]
,)

They are numerical Jacobian estimated with point perturbations, and analytical Jacobian computed from autograd.

The gradcheck error message isn’t the nicest… it is improved on master though!

Numerical Jacobian seems imply that your module is an identity op, but analytical says the output is independent of the input. (assuming that the first matrix is I and the second is 0.)

Gotcha. So numerical is the result of the grad comparing 2 very close values from the forward pass.

And analytical is what autograd computed using the automatic differentiation.

They should closely match, or there is an error.

Yes, that is correct.

Are you seeing this from a custom autograd function (i.e. you wrote the backward), or from the built-in autograd operations (i.e. backward is automatically computed)?

It’s from a built in. I think it’s caused by the way I declared Variables on the input parameters. Still tracking it down.

Sorry, I should have explained a bit better for the community…

So my understanding is an error like this…

RuntimeError: for output no. 0,
numerical:(
1 0 0 … 0 0 0
0 1 0 … 0 0 0
0 0 1 … 0 0 0
… ⋱ …
0 0 0 … 1 0 0
0 0 0 … 0 1 0
0 0 0 … 0 0 1
[torch.FloatTensor of size 150x150]
,)
analytical:(
0 0 0 … 0 0 0
0 0 0 … 0 0 0
0 0 0 … 0 0 0
… ⋱ …
0 0 0 … 0 0 0
0 0 0 … 0 0 0
0 0 0 … 0 0 0
[torch.FloatTensor of size 150x150]
,)

The giveaway is the matrix of Zeros on the “analytical” result. All the grads are Zero! I guess this means that somewhere in the computation graph, autograd lost track of the gradients. Possibly by passing through a Variable I wrote that didn’t have gradients in it.

That’s how Im interpreting that result.

So I’m checking all my Variables to make sure the ones that should have gradients do have them.

2 Likes

Have your problem been solved? I am implementing a custom operator according to https://pytorch.org/docs/master/notes/extending.html, and also get zero analytical gradients. I am wondering whether it is the problem of the inputs or the problem of my implementation of the backward function.

Just fix the problem. Pay attention to the tutorial:

input = (Variable(torch.randn(20,20).double(), requires_grad=True), Variable(torch.randn(30,20).double(), requires_grad=True),)
test = gradcheck(Linear.apply, input, eps=1e-6, atol=1e-4)

We must set requires_grad=True for the variables.

Yes, thats a bit confusing too, gradcheck tests to see if requires_grad=True on each input, if it is False, it doesn’t run the check on that input.

Normally during training grad=True this is only required if you are intending to propagate the gradient outside the model, say into embeddings or something like that. That’s the confusing bit :slight_smile:

However, it is helpful, because sometimes there are inputs that you don’t actually want to grad-check. So in that case, you can have gradcheck ignore them by setting requires_grad=False.

1 Like