Hi there, I have a call in software
/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in one_batch(self, i, b)
161 self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')
162 if not self.training: return
--> 163 self.loss.backward(); self('after_backward')
164 self.opt.step(); self('after_step')
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
182 products. Defaults to ``False``.
--> 184 torch.autograd.backward(self, gradient, retain_graph, create_graph)
186 def register_hook(self, hook):
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
122 tensors, grad_tensors, retain_graph, create_graph,
--> 123 allow_unreachable=True) # allow_unreachable flag
So it gives
RuntimeError: vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)
So it seems that the calculation of the loss causes a type of internal assertion. Would love to know how to debug this type of error. By the way, it had calculated the training and validation pass correclty AFAIK. SO I guess tensor in some way are working correctly.
This is quite unexpected!
Can you share some code that make this happen?
Also running with nightly build, can you run with
TORCH_SHOW_CPP_STACKTRACES=1 to get more information about where it comes from please?
the actual code is below this search for
Will try to load nighlty and do
!TORCH_SHOW_CPP_STACKTRACES=1 python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev I guess? to activate the CPP traces.
Well it seems that adding that flag like I thought didnt activate the stack trace for CPP (maybe I need to build on colab?) !TORCH_SHOW_CPP_STACKTRACES=1 python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev
so updated a little the code for some extra direct links to the code that is called when it uses fit I copied the code of one batch there and
Learner, so it uses this versions that have “prints” instead of the original, so internally fit calls this modified
TORCH_SHOW_CPP_STACKTRACES should be set in the runtime environement BEFORE you import torch for the first time (not when you install torch).
If you use xla on colab, you should have the latest version so this should print extra informations.
cc @ailzhang if you see anything obvious here ?
hmmmm although I haven’t used fastai2 with torch_xla, the error colab does seem to trigger a runtime error on xla side. Would you mind opening an issue in pytorch/xla github repo and we can followup there? Thanks!