Why is the previous input symbol's tensor detached in seq2seq decoder in NMT

Vivek_Tyagi1 · March 11, 2022, 10:29am

Consider the tutorial on seq2seq NMT on PyTorch documentation.

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

On this page in the method train (towards the end of the page) , we have decoder_input = topi.squeeze().detach() # detach from history as input

My question is why do we have to detach this tensor from history. In fact this is the decoded word/char with the highest probability in the decoder at the previous time step and it will be input to the next time steps’ GRU decoding along with the attention weighted encoder_outputs. So theoretically we want the gradients of this tensor i.e. decoder_input to be calculated w.r.t all the time steps it directly or indirectly (thru GRU) affected. So that means we should not detach its history to allow the gradients to flow to previous time steps too.

In fact this is what is happening in the RNN or LSTM cells too where we allow the memory cell, forget gate, input’s gate etc’s gradients to flow across all time steps and we don’t explicitly detach any tensor. So why is the decoder_input tensor detached in this case. To me this seems like a major bug which can affect the theoretical basis of the encoder-decoder with attention NMT model and thereby it accuracy.

def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()
    ...
    ...
    
    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input