Debugging backward errors

googlebot · September 7, 2020, 3:29am

I have a big composite module, that jit compiles and executes forward() ok, but fails in backward(). The big issue is that there is no error with jit disabled, and set_detect_anomaly() is not too helpful in jit mode.

I’m 90% sure the error itself is related to how jit incorrectly enables requires_grad in multiple scenarios. But are there any techniques to localize it?

For reference, here is exception text:

The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<string>", line 138, in <backward op>
                   dim: int):
            def backward(grad_outputs: List[Tensor]):
                grad_self = torch.stack(grad_outputs, dim)
                            ~~~~~~~~~~~ <--- HERE
                return grad_self, None
RuntimeError: sizes() called on undefined Tensor

So, some generated code, look like for unbind() operation, where outputs have inconsistent requires_grad?

Also note how “dim:int” line cutoff does a bad service.

And console:

[W …\torch\csrc\autograd\python_anomaly_mode.cpp:104] Warning: Error detected in struct torch::jit::`anonymous namespace’::DifferentiableGraphBackward. Traceback of forward call that caused the error:
…
has traceback that stops at jit module

tom · September 7, 2020, 6:52am

To me the error looks like you have an undefined Tensor (aka None) showing up unexpectedly.
I would try to cut down the model (or just the JITed part) a bit to zoom into where it happens. (You can see a method I use to grab a submodule when searching for DebugWrap in my blog Post on PyTorch and TVM).
Naturally, we would be most grateful if you found a reproducing snippet that you can share to fix this in PyTorch.

Best regards

Thomas

googlebot · September 7, 2020, 11:26am

I couldn’t easily divide that module (issue with mutable objects in mixed mode). What helped somewhat was running backward() in profiler context, exporting chrome trace and looking at last successful ops, but that’s not a good or reliable solution. Luckily, I identified unbind from the above message, and indeed unbind was the problem.

Now, for some reason I failed to reproduce the failure in a small script, however I think it is somehow related to “Backward through view of unbind output” issue and seems to only fail in “legacy” jit mode (which is 1.6 default), so perhaps this will be handled in next release.