Reentering Variable._execution_ending.run_backward( twice. Error:Trying to backward through the graph a second time, but the saved intermediate results have already been freed

Hello,

I am having an issue with implementing a seq2seq network’s training loop. Although I have followed a plethora of tutorials and examples online (and yes, read a bunch of guidance from this forum and stackoverflow), but even with those permutations, when I run this loop
‘’’
def train_model_encdec(train_loader, output_indexer):
“”"
Function to train the encoder-decoder model on the given data.
:param train_data:
:param test_data:
:param input_indexer: Indexer of input symbols
:param output_indexer: Indexer of output symbols
:param args:
:return:
“”"

encoder = EncoderRNN(input_size=238, hidden_size=768)
decoder = AttnDecoderRNN(hidden_size=768, output_size=255)
model = Seq2Seq(encoder, decoder)

criterion = nn.CrossEntropyLoss(ignore_index=0)  # ignore padding
model.optimizer = torch.optim.Adam(model.parameters(), lr=model.lr)
print(model)
epochs = 10  # TODO: Change back to 10
model.train()
for epoch in range(0, epochs):
    print('epoch num: ')
    print(epoch)
    epoch_loss = 0
    for sents, labels in train_loader:

        model.optimizer.zero_grad()
        output = model.forward(sents, labels)

        output = output[1:].view(-1, output.shape[-1])
        labels_t = torch.transpose(labels, 0, 1)
        labels = torch.reshape(labels_t[1:], (-1,))  # use reshape instead

        labels_long = labels.type(torch.LongTensor)
        loss = criterion(output, labels_long)
        epoch_loss += loss.item()

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
        model.optimizer.step()

    loss_normalized = epoch_loss * 2  # ?
    print(loss_normalized)

‘’’
I get the,
“Trying to backward through the graph a second time…” error.

Solutions I have tried that have failed:

  1. Initializing hidden states before running every batch
  2. Doing step 1 and implementing a detach hidden states function

I’m also trying to avoid setting retain_graph=True as a solution because I don’t think I require it and I would not like to spend more time than necessary training.

I decided to also compare similar code on an older machine.
On the machine where this training loop errors, I am runnning PyTorch 1.7.0 and Python 3.8

However, I ran similar code on a different computer using PyTorch 1.3.0 and Python 3.6 and that training loop worked. The differences between these codes only resides in the use of BertPretrained Embeddings (in the machine that errors) rather than my own pretrained embedding layer.

I also ran both codes in debug on Pycharm and noticed that both codes have the same typing at the loss = criterion(output, labels) line but when entering the loss.backwards() function, the code that fails enters the function: ‘’‘Variable.execution_engine.run_backward(tensors, grad_tensors, retain_graph, create_Graph, allow_unreachable=True)’’’ twice. The first time it enters, the values are identical with the program that worked on the older PyTorch/Python version (except for the actual tensor.float value of the gradient). The working version exits the function while the broken one decides to reenter that function.

This is why I think I’m getting the error of calling Backward twice, but to my knowledge, it’s not my code that’s calling it twice. I am also unable to step through ‘’‘Variable.execution_engine.run_backward(tensors, grad_tensors, retain_graph, create_Graph, allow_unreachable=True)’’ because of its cpp implementation.

Any help would be incredibly appreciated, as I’ve been banging my head against a wall for quite sometime now.

Hey guys, I figured it out.

As someone new to using BERT encodings, I forgot to wrap my BERT generated embeddings with

with torch.no_grad()

I pray for any new NLP learners out there to find this post and not spend 12+ hours debugging. Good luck to you all!