LSTM Binary Classifier Backward() Error During Training

msabrii · December 2, 2022, 10:37pm

Hi, I am getting this error and I have looked at other forums but nothing has worked, not sure why this is happening in the first place.

Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

My code is as follows :

class LSTM(nn.Module):
def init(self, embedding_dim, hidden_dim):

    super().__init__()
    self.hidden_dim = hidden_dim
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
    self.dropout = nn.Dropout(0.2)
    self.fc = nn.Linear(hidden_dim, 1)
    
def forward(self, src):
    lstm_out, (ht, ct) = self.lstm(src)
    return self.fc(output).squeeze()

model = LSTM(1462, 20)

optimizer = torch.optim.Adam(model.parameters(), lr=0.002)
criterion = F.cross_entropy

for epoch in range(2): # loop over the dataset multiple times

running_loss = 0.0
for i, data in enumerate(train_loader, 0):

    # get the inputs; data is a list of [inputs, labels]
    inputs, labels = data

    # zero the parameter gradients
    optimizer.zero_grad()

    # forward + backward + optimize
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    # print statistics
    running_loss += loss.item()
    if i % 2000 == 1999:    # print every 2000 mini-batches
        print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
        running_loss = 0.0

print(‘Finished Training’)

ptrblck · December 3, 2022, 1:02am

Your code uses an undefined output tensor in self.fc(output), so check where this tensor is coming from and if it’s causing the error.

dreidizzle · December 12, 2022, 9:34pm

The above answer is right, I think you want to use the last hidden layer to do this, see the code below. Also, you might want to do gradient clipping before you call optimizer.step().

self.fc(ht).squeeze()

mygithubid1 · December 13, 2022, 6:38am

Why use ht instead of lstm_out ?

dreidizzle · December 13, 2022, 6:59am

Actually maybe not but my interpretation was that you want to get a representation for the text (so use the last hidden state to represent the entire text you want to predict the sentiment of or something else). The output is for each state (time step). I’m guessing you don’t want to predict for each state unless you have a language model, just per sentence or text, so use the last hidden state as the representation. No? He could also combine the hidden states in some way and then also feed to the forward layer but unsure.

mygithubid1 · December 14, 2022, 4:55pm

In tensorflow, output of the last timestep is passed to a dense layer if required for prediction. Here’s the text to look for in this link.

return_sequences Boolean. Whether to return the last output in the output sequence, or the full sequence. Default: False.

Here’s an example.

dreidizzle · December 14, 2022, 5:22pm

Aha right. So this line: lstm_out, (ht, ct) = self.lstm(src) returns the full sequence AND the last hidden and context states, per batch. If it’s batch first you have lstm_out[:, -1, :] == ht. So, I’m thinking he wants to use the last step’s hidden. PyTorch always returns (all sequence of hidden, (last hidden, last context)) … batch_first controls if you get N X L X D or L X N X D where L is the length in time and N is the batch size.

mygithubid1 · December 14, 2022, 6:10pm

Agreed (based on batch_first=True): (outputs[:, -1] == hidden_states[-1]).all()

Since this topic is on a binary classification task, using the output of the last time step makes sense to me as well.

vdw · December 15, 2022, 4:24am

Note that this only holds for bidirectional=False.

mygithubid1 · December 15, 2022, 6:30am

You are right about that.

msabrii · December 19, 2022, 7:10am

Hi, could you please explain this? You said I should use the output of the last time step, right now I am just using the entire lstm_out as the input to my linear layer. Should I be using something different? How would this change if bidirectional = True? Thank you all for you’re helpful replies!

vdw · December 20, 2022, 6:50am

You can have a look at the older post.

I way to handle it can be found here; search for # Handle directions in the forward() method.

dreidizzle · December 22, 2022, 6:09pm

This is tricky and you are using bidirectional=False (the default) but here is my info on this. This is the resource: LSTM — PyTorch 1.13 documentation and you can look here: “For bidirectional LSTMs, h_n is not equivalent to the last element of output; the former contains the final forward and reverse hidden states, while the latter contains the final forward hidden state and the initial reverse hidden state.”

Here is an example.

Imagine no batches, so dimensions are L X D for everything.

Basically, imagine you do sentiment analysis and you have a sentence that tokenizes to [1,2, 3, 4] and you embed this into a 4 X 128 vector. If you want to do a forward RNN, you should use the last hidden layer. This is hidden or output[-1, :].

If you use a bidirectional RNN, you’d probably want to feed hidden and NOT output to your softmax. output in this case has [(h1_forward, h1_backward), (h2_forward, h2_backward), (h3_forward, h3_backward), (h4_forward, h4_backward)]. But, the backward RNN starts at step 4 so (h4_forward, h4_backward) contain’s the forward RNN’s encoding but the second element has almost no information (yet) as far as the backward RNN is concerned.

An “encoding” of the sentence is thus (h4_forward, h1_backward), the result of the forward RNN’s pass and the backward RNN’s pass. This is probably what you’d like to feed to the classifier head.

I.e., When you have bidirectional RNN, the last output IS NOT the hidden state returned.