Truncated BPTT for Language Modelling

marco_zaror · March 22, 2020, 3:32pm

Hi,
I’ve working with N-gram Language Modelling, taking n words and predicting the word n+1. My code is working but I’ve got a doubt in particular. Specifically, for each word I’m obtaining the output and the hidden state, and then passing those again for n steps. Finally, I compute the loss and update the gradients. My question is: Because I’m renaming the hidden state with every word. I’m doing truncated bptt? (Because I’m only gonna have the last hidden vector available when do the backprop?)

def train(text_x_tensor1,label1):#, text_x_tensor2, label2):
    text_x_tensor1, label1 = text_x_tensor1.to(device), label1.to(device)
    rnn1.train()
    hidden_1 = rnn1.initHidden()
    hidden_1 = hidden_1.to(device)
    text_x_tensor1 = text_x_tensor1.permute(1,0,2)
    for i in range(len(text_x_tensor1)): #For each word
        output_1, hidden_1 = rnn1(text_x_tensor1[i], hidden_1) 
    loss1 = criterion(output_1,label1)
    optimizer1.zero_grad()
    loss1.backward()#retain_graph=True) #Keep the intermediate buffers
    torch.nn.utils.clip_grad_norm_(rnn1.parameters(),1)
    optimizer1.step()
    return output_1, loss1,hidden_1

Thanks in advance!