Truncated BPTT for Language Modelling

I’ve working with N-gram Language Modelling, taking n words and predicting the word n+1. My code is working but I’ve got a doubt in particular. Specifically, for each word I’m obtaining the output and the hidden state, and then passing those again for n steps. Finally, I compute the loss and update the gradients. My question is: Because I’m renaming the hidden state with every word. I’m doing truncated bptt? (Because I’m only gonna have the last hidden vector available when do the backprop?)

def train(text_x_tensor1,label1):#, text_x_tensor2, label2):
    text_x_tensor1, label1 =,
    hidden_1 = rnn1.initHidden()
    hidden_1 =
    text_x_tensor1 = text_x_tensor1.permute(1,0,2)
    for i in range(len(text_x_tensor1)): #For each word
        output_1, hidden_1 = rnn1(text_x_tensor1[i], hidden_1) 
    loss1 = criterion(output_1,label1)
    loss1.backward()#retain_graph=True) #Keep the intermediate buffers
    return output_1, loss1,hidden_1

Thanks in advance!