Got the tracing info by applying torch.autograd.set_detect_anomaly(True).
The error shows the
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [2, 1, 1024]] is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
For inplace errors, the anomaly mode points to the operation that saved the Tensor for later use. Not the one that modified it inplace.
But given that it conplains that a LongTensor is faulty. I’m fairly confident in the fact that you change positions inplace later in your code That would be causing the issue as gather() needs these positions to be able to compute its backward.
Versions tracks how many times it has been changed inplace.
Inplace modifications are anything like positions[foo] = or positions.add_(foo) or positions += 2.
a simple fix you can use is replace your gather call to:
states = hidden_states.gather(-2, positions.clone())
The clone here will make sure that you don’t share memory between the version used by gather and the one used in the rest of your code. That way inplace operations in the rest of your code won’t be a problem for gather.