Anomaly detection points to LayerNormalization operation for in-place error

geeatwork777 · May 20, 2023, 7:04pm

Hello,

I am working on a custom Decoder-Only Transformer model, for which I am trying to test whether backpropagation works in general. I am relatively new to PyTorch, so I might overlook the main issue. In order to check for compatibility, I am trying to check how the outputs are adjusted for three very basic inputs. Unfortunately, I first encountered “RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)”, and set retain_graph=True to combat this issue. Unfortunately, now I get “RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [6]] is at version 3; expected version 1 instead”, but I cannot seem to figure out where an in-place operation takes place. I have ensured that I use no += or *= operations, and no methods with trailing underscore. Anomaly detection simply points me to

x = self.N_2(x).reshape(T,B,S,E).contiguous()

, where N_2 is a simple LayerNormalization. I have tried for hours to insert multiple clone() statements, and changing variable names, but I am fairly certain that there are no in-place operations.

My training loop is very simple, since I only attempted to see whether backpropagation works at all:

batch = [(embedding, output), (embedding2, output2), (embedding3, output3)]

optimizer = torch.optim.AdamW(decoder.parameters(), lr=0.0001)

for i in range(3):
    with torch.autograd.detect_anomaly():
        preds, attention, loss = decoder(batch[i][0], targets=batch[i][1])
        optimizer.zero_grad(set_to_none=True)
        loss.backward(retain_graph=True)
        optimizer.step()
        print(preds)

Is there something I overlooked? Would be really grateful for help, since I have been stuck at this point for quite some time…

ptrblck · May 20, 2023, 7:34pm

You are most likely creating the issue by using retain_graph=True, so set it to False again (or remove it) and try to fix the original error:

geeatwork777 · May 20, 2023, 7:52pm

Thank you for the quick reply! This was my first idea as well, but I was unable to find a solution that fit my model. Most threads I found online mentioned either detaching a hidden state, or issues with having actor and critic optimizers, but neither are applicable for me. Could you provide me with an idea how to resolve it?

ptrblck · May 20, 2023, 8:01pm

Without seeing a code snippet I can only refer to my previous posts, which you might already have seen.

geeatwork777 · May 20, 2023, 8:55pm

Yeah, I fear I have read the majority of threads on the error without success. The full model is unfortunately too large for me to fully share here, but as I mentioned it is similar to a simplified GPT in structure. I exclusively use Linear layers, LayerNormalization and SSN neurons from the SpikingJelly library (which use a surrogate gradient) and do a scaled dot product in the middle. I do not use a hidden state, and also do not apply it in an autoregressive manner (yet). Thus I am a bit surprised the graph from the first sample is needed at all… Could saving and loading the model state after each iteration, or a similar approach work here? (Although that sounds like a very bad solution to me).

ptrblck · May 20, 2023, 9:08pm

This sounds indeed like a bad solution and I would try to check which tensor might still be attached to a computation graph from the previous iteration. To do so you could print the .grad_fn of e.g. the input and making sure it’s set to None in a new iteration.

geeatwork777 · May 20, 2023, 9:20pm

Alright, will try that. Thank you so much for the hint!

geeatwork777 · May 21, 2023, 1:39am

The issue was in fact located within my usage of SNN neurons. While the surrogate gradient works correctly, the Neurons expectedly retain information from their voltage in the previous iteration. This was also shown once I looked more carefully in the anomaly detection. I have created a corresponding reset method for the model and once I applied it in the loop, the issue was resolved. Since surrogate gradient optimization for SNNs is a relatively new (and rare) topic, I did not find it as the cause anywhere online. Thank you again for your help @ptrblck , I would not have been able to identify the issue if I had not deleted the retain_graph=True.