Where is grad_fn set in pytorch source?(version 1.6)

AlexLuya · August 22, 2020, 1:27pm

Hello,we have a framework which will be responsible for doing configuration,start/stop training,etc,and it can be used to run another model training without any problem,so this framework seems work fine,but when it used to train my current model,the first epoch runs fine,so this model seems work fine also,but when the second epoch just begin,an exception got thrown out:

  File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

by debugging,I found that the output tenor of network has grad_fn = None,and this is reproduciable: always comes in FIRST backwarding of SECOND epoch.

Since this framework can be used to train other modes without any problem,and for current model,the first epoch runs fine also,so I want to dig deep to find out why grad_fn doesn’t got set,and after searching the source code of pytorch,I found it seems like being set in nn.modules.module.py line 736

            grad_fn = var.grad_fn
            if grad_fn is not None:

but debugging doesn’t confirm my searching,so would you please tell where is grad_fn got set,thanks?

ptrblck · August 24, 2020, 9:52am

I would recommend to check the Python code, especially the forward pass, which would create the grad_fns for all tensors, where the gradient calculation is not disabled.
Based on your description, did you use torch.set_grad_enabled in any part of the code or a with torch.no_grad() wrapper accidentally for some epochs?

AlexLuya · September 2, 2020, 1:36am

Thanks,in may case,‘self.requires_grad_(False)’ caused this problem