Assert in autograd

jlquinn · December 19, 2017, 5:26pm

Hi folks,

I’m trying to get a seq2seq trainer working in pytorch. This is the first time I’ve thrown a lot of data at the engine, but that may be irrelevant. This is 0.3.0 on ppc64le, rhel 7.2, python 2.7.14, cuda 9, cudnn 7, gcc 6.4.

INFO: Epoch: 0; finished=55.89 %; 24640 updates; time=29471.03 secs; agv_nll=1.76676738262; perplexity=5.85190578042
INFO: Epoch: 0; finished=55.91 %; 24650 updates; time=29484.32 secs; agv_nll=2.19744563103; perplexity=9.00198970313
Traceback (most recent call last):
File “/dccstor/jlquinn01/mnlp-nn-rl-kit/src/mnlp/nn/seq2seq/tools/trainSeq2Seq.py”, line 230, in
nll.backward()
File “/dccstor/jlquinn-mt/nmt-env/ppc64le/lib/python2.7/site-packages/torch/autograd/variable.py”, line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File “/dccstor/jlquinn-mt/nmt-env/ppc64le/lib/python2.7/site-packages/torch/autograd/init.py”, line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion pos >= 0 && pos < buffer.size() failed.

nll is the loss after the model has been run. I’m using DataParallel on 3-4 gpus.
I haven’t filed a bug since I’m not sure it’s not due to my own bug. Does anyone have any thoughts?

Thanks
Jerry

SimonW · December 19, 2017, 5:40pm

The issue is tracked here: https://github.com/pytorch/pytorch/issues/3883. It will be fixed soon.

jlquinn · December 19, 2017, 9:46pm

Thank you for the info!