Say I want to check the gradient of f_\theta(x) wrt \theta, we should call gradcheck(f, x.requires_grad_()) rather than just gradcheck(f, x), why is this the case? The quantity we care about is df/d\theta, to do this, I guess Pytorch numerically check whether |f(x+h)-f(x)-grad(f)(x)h|/|h|->0? (or some other numerically efficient version). It seems redundant to let x require grad?

Hi,

Indeed the gradients are only checked for the Tensors that are explicit inputs to the function. So if none of them requires gradients, that means that no gradient needs to be checked.

You will need to give theta as an input to your function to be able to use gradcheck I’m afraid.

Hi albanD. Thanks for your reply. Here I use \theta to represent the parameters of the network f. So say if f=nn.Linear(a,b), then we can write f(x)=Wx, and here \theta will be W. It doesn’t make sense to pass W as an input to gradcheck(), no? Because the input to gradcheck() should be the input to the forward() function of f. Are you suggesting that gradcheck() calculates df/dx rather than df/d\theta?

Yes gradcheck only checks gradient for inputs to the function that are Tensor that require gradients.

Note that in this case, you can “cheat” by doing: `gradcheck(lambda inp, *ignore: model(inp), (inp, *model.parameters())`

.

This will make all the parameters look like inputs for gradcheck. And all inplace changes to them must properly update the forward of the model (this will work with pytorch vanilla Module at least).