Hi,
I’ve working with N-gram Language Modelling, taking n words and predicting the word n+1. My code is working but I’ve got a doubt in particular. Specifically, for each word I’m obtaining the output and the hidden state, and then passing those again for n steps. Finally, I compute the loss and update the gradients. My question is: Because I’m renaming the hidden state with every word. I’m doing truncated bptt? (Because I’m only gonna have the last hidden vector available when do the backprop?)
def train(text_x_tensor1,label1):#, text_x_tensor2, label2):
text_x_tensor1, label1 = text_x_tensor1.to(device), label1.to(device)
rnn1.train()
hidden_1 = rnn1.initHidden()
hidden_1 = hidden_1.to(device)
text_x_tensor1 = text_x_tensor1.permute(1,0,2)
for i in range(len(text_x_tensor1)): #For each word
output_1, hidden_1 = rnn1(text_x_tensor1[i], hidden_1)
loss1 = criterion(output_1,label1)
optimizer1.zero_grad()
loss1.backward()#retain_graph=True) #Keep the intermediate buffers
torch.nn.utils.clip_grad_norm_(rnn1.parameters(),1)
optimizer1.step()
return output_1, loss1,hidden_1
Thanks in advance!