Hi everyone, I am learning LSTM. I try official LSTM example as follows:
for epoch in range(300): # again, normally you would NOT do 300 epochs, it is toy data
for sentence, tags in training_data:
# Step 1. Remember that Pytorch accumulates gradients.
# We need to clear them out before each instance
model.zero_grad()
# Also, we need to clear out the hidden state of the LSTM,
# detaching it from its history on the last instance.
model.hidden = model.init_hidden()
# Step 2. Get our inputs ready for the network, that is, turn them into
# Variables of word indices.
sentence_in = prepare_sequence(sentence, word_to_ix)
targets = prepare_sequence(tags, tag_to_ix)
# Step 3. Run our forward pass.
tag_scores = model(sentence_in)
# Step 4. Compute the loss, gradients, and update the parameters by
# calling optimizer.step()
loss = loss_function(tag_scores, targets)
loss.backward()
optimizer.step()
However, I have a question about the backpropagation:
loss = loss_function(tag_scores, targets)
loss.backward()
optimizer.step()
These three code lines seem to have nothing to do with sequence step, but I think LSTM needs to be trained with BPTT. Could you tell me the reason? And moreover, I also wonder when BPTT should be applied and how to realize BPTT with Pytorch? Thank you in advance.
Doesn’t the initialization of the hidden state at each iteration over the data in the quoted example effectively result in truncated BPTT (compare this posting)?
Thank you, but I should express more clearly. This example has done BPTT for a sentence without truncation. However, could you tell me how to realize a BPTT with truncation for this example?
To do BPTT with truncation you would need to cut the input into subsequences and train on each subsequence, separately but in order. The subsequences need to be fed to the model in order so that the last hidden state from the end of each subsequence is used at the beginning of the next subsequence.
The data input is sentence_in and the corresponding targets are targets.
You would have to split sentence_in and target into corresponding chunks along the sequence dimension.
From my point of view, the batch_size and sequence_step used in this demo are 1 (i.e., mini_batch) and the length of a sentence, respectively.
In other words, a sentence is a batch (because loss is calculated after each batch) and the loss is propagating back through the whole input of a sentence.
I hope I got the right logic since I’m also a noob playing around with PyTorch.