How to find the inplace operation error during the backward?

Jeijgenhuijsen · June 8, 2023, 8:50am

I build a custom LSTM model (it’s quite large), and somewhere in there there is an inplace operation which gives me an error during training.

error:

 File "c:\Users\jobei\Desktop\scriptie msc\code\models\train_model.py", line 46, in closure
    loss.backward()
  File "C:\Users\jobei\anaconda3\envs\machinelearning\lib\site-packages\torch\_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Users\jobei\anaconda3\envs\machinelearning\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [16, 16]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I’ve already set torch.autograd.set_detect_anomaly(True), but it doesn’t really give me any useful info. Just trying out every small part of the model, but this takes a long time. Hope there is something better I can do.

The only 16x16 tensors in the model are the weights of the gate networks if I’m correct, but I never touch those myself as far as I’m aware.

this is the github link to the files if anyone is interested:

training is where I put the training loops.
simple_linear_networks contains the LSTM gate networks etc,
hawkes loss is a custom loss function I need for this network
sd_PNHP is the network itself. it is a stacked continuous time lstm with two cell states per cell.
it consists of a cellstate class, a class that stacks 4 cellstates onto eachother and finally a layer class that loops through these stacked cells.

KFrank · June 8, 2023, 3:06pm

Hi J!

In fact, set_detect_anomaly (True) is generally quite useful. Look at the
lines of code in The “Traceback of forward call” that detect-anomaly causes
to be printed. (Look at your lines of code – the internal pytorch code is less
likely to be helpful.) One of those lines of code is likely to be causing the inplace
modification.

Print out ._version for those 16x16 weight tensors before and after your
forward pass. Is ._version increasing for any of them, and, in particular,
ending up with a value of 2? If so, they are being modified somewhere and
that is likely to be the cause of your problem. If not, you have some other
16x16 tensors somewhere. Find them and check if their ._versions are
showing inplace modifications.

Good luck.

K. Frank

Jeijgenhuijsen · June 9, 2023, 8:00am

Thanks alot Frank! I’ve found the problem, Silly mistake. Forgot to clear a list after every run