Debug autograd?

tyoc213 · July 9, 2020, 6:30pm

Hi there, I have a call in software

/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in one_batch(self, i, b)
    161             self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')
    162             if not self.training: return
--> 163             self.loss.backward();                            self('after_backward')
    164             self.opt.step();                                 self('after_step')
    165             self.opt.zero_grad()
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    182                 products. Defaults to ``False``.
    183         """
--> 184         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    185 
    186     def register_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
    121     Variable._execution_engine.run_backward(
    122         tensors, grad_tensors, retain_graph, create_graph,
--> 123         allow_unreachable=True)  # allow_unreachable flag
    124 
    125

So it gives

RuntimeError: vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)

So it seems that the calculation of the loss causes a type of internal assertion. Would love to know how to debug this type of error. By the way, it had calculated the training and validation pass correclty AFAIK. SO I guess tensor in some way are working correctly.

albanD · July 9, 2020, 9:52pm

Hi,

This is quite unexpected!
Can you share some code that make this happen?
Also running with nightly build, can you run with TORCH_SHOW_CPP_STACKTRACES=1 to get more information about where it comes from please?

tyoc213 · July 9, 2020, 10:20pm

the actual code is below this search for tpu_learner.fit(1).

Will try to load nighlty and do !TORCH_SHOW_CPP_STACKTRACES=1 python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev I guess? to activate the CPP traces.

tyoc213 · July 9, 2020, 10:37pm

Well it seems that adding that flag like I thought didnt activate the stack trace for CPP (maybe I need to build on colab?) !TORCH_SHOW_CPP_STACKTRACES=1 python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev

so updated a little the code for some extra direct links to the code that is called when it uses fit I copied the code of one batch there and @patched the Learner, so it uses this versions that have “prints” instead of the original, so internally fit calls this modified one_batch.

albanD · July 9, 2020, 10:42pm

The TORCH_SHOW_CPP_STACKTRACES should be set in the runtime environement BEFORE you import torch for the first time (not when you install torch).

If you use xla on colab, you should have the latest version so this should print extra informations.

cc @ailzhang if you see anything obvious here ?

ailzhang · July 9, 2020, 11:20pm

hmmmm although I haven’t used fastai2 with torch_xla, the error colab does seem to trigger a runtime error on xla side. Would you mind opening an issue in pytorch/xla github repo and we can followup there? Thanks!

tyoc213 · July 10, 2020, 12:06am

got it https://github.com/pytorch/xla/issues/2340