I am having an issue with implementing a seq2seq network’s training loop. Although I have followed a plethora of tutorials and examples online (and yes, read a bunch of guidance from this forum and stackoverflow), but even with those permutations, when I run this loop
def train_model_encdec(train_loader, output_indexer):
Function to train the encoder-decoder model on the given data.
:param input_indexer: Indexer of input symbols
:param output_indexer: Indexer of output symbols
encoder = EncoderRNN(input_size=238, hidden_size=768) decoder = AttnDecoderRNN(hidden_size=768, output_size=255) model = Seq2Seq(encoder, decoder) criterion = nn.CrossEntropyLoss(ignore_index=0) # ignore padding model.optimizer = torch.optim.Adam(model.parameters(), lr=model.lr) print(model) epochs = 10 # TODO: Change back to 10 model.train() for epoch in range(0, epochs): print('epoch num: ') print(epoch) epoch_loss = 0 for sents, labels in train_loader: model.optimizer.zero_grad() output = model.forward(sents, labels) output = output[1:].view(-1, output.shape[-1]) labels_t = torch.transpose(labels, 0, 1) labels = torch.reshape(labels_t[1:], (-1,)) # use reshape instead labels_long = labels.type(torch.LongTensor) loss = criterion(output, labels_long) epoch_loss += loss.item() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1) model.optimizer.step() loss_normalized = epoch_loss * 2 # ? print(loss_normalized)
I get the,
“Trying to backward through the graph a second time…” error.
Solutions I have tried that have failed:
- Initializing hidden states before running every batch
- Doing step 1 and implementing a detach hidden states function
I’m also trying to avoid setting retain_graph=True as a solution because I don’t think I require it and I would not like to spend more time than necessary training.
I decided to also compare similar code on an older machine.
On the machine where this training loop errors, I am runnning PyTorch 1.7.0 and Python 3.8
However, I ran similar code on a different computer using PyTorch 1.3.0 and Python 3.6 and that training loop worked. The differences between these codes only resides in the use of BertPretrained Embeddings (in the machine that errors) rather than my own pretrained embedding layer.
I also ran both codes in debug on Pycharm and noticed that both codes have the same typing at the loss = criterion(output, labels) line but when entering the loss.backwards() function, the code that fails enters the function: ‘’‘Variable.execution_engine.run_backward(tensors, grad_tensors, retain_graph, create_Graph, allow_unreachable=True)’’’ twice. The first time it enters, the values are identical with the program that worked on the older PyTorch/Python version (except for the actual tensor.float value of the gradient). The working version exits the function while the broken one decides to reenter that function.
This is why I think I’m getting the error of calling Backward twice, but to my knowledge, it’s not my code that’s calling it twice. I am also unable to step through ‘’‘Variable.execution_engine.run_backward(tensors, grad_tensors, retain_graph, create_Graph, allow_unreachable=True)’’ because of its cpp implementation.
Any help would be incredibly appreciated, as I’ve been banging my head against a wall for quite sometime now.