Can we train a model with only a BiLSTM layer?

Hi

I am experimenting with a network being a part of another large network to find out whether the experimental model calculates gradient. The experimental setup carries only a BiLSTM layer(the model is not carrying any linear layer purposefully) having an input of size torch.Size([64, 256]). The model structure is following:

class Experimental_(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(256, 128,2,batch_first=True,bidirectional=True)
      
    def forward(self, input):
          lstm_output, (h,c) = self.lstm(input) 
          return lstm_output.view(-1,128*2)

The training method is designed as following:

def train():
    model.train()
    model.zero_grad()
    output_= model(input)       
    loss = lossFunction(output_,train_y)        
    loss.backward()
    # #aim is to check the gradient values of bilstm layers
    #for name, param in model.named_parameters():
    #      print(name, param.grad)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    

I use the following driver code:

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Experimental_()
optimizer = AdamW(model.parameters(), lr=2e-5)
lossFunction = nn.NLLLoss()
epochs = 1
current = 1
while current <= epochs:
    train()
    current = current + 1

But after executing the module, I received the following error:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Could you please help me understanding the missing links in these lines of codes?

Which version of torch are you using?
The following runs without error for me locally (built from source recently) and colab :

import torch

class Experimental_(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = torch.nn.LSTM(256, 128,2,batch_first=True,bidirectional=True)

    def forward(self, input):
          lstm_output, (h,c) = self.lstm(input)
          return lstm_output.view(-1,128*2)

def train():
    model.train()
    model.zero_grad()
    output_= model(inp)
    loss = lossFunction(output_,train_y)
    loss.backward()
    # #aim is to check the gradient values of bilstm layers
    #for name, param in model.named_parameters():
    #      print(name, param.grad)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

inp = torch.randn(10, 256)
train_y = torch.randint(0, 10, (10,))
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Experimental_()
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
lossFunction = torch.nn.NLLLoss()
epochs = 1
current = 1
while current <= epochs:
    train()
    current = current + 1