LSTM dramatic overfit - Similarity of texts - what is my model doing wrong?

I’m trying to grade the similarity of two text inputs. I have approached the problem by creating a LSTM network which takes as input the text of one sample. I then combine the LSTM outputs of the two texts, and pass these on to fully connected layers - the last layer having output size of 1, which is the similarity. When I train the model, the loss does somewhat decrease, but in validation, the model just outputs the same nearly identical value each time. I’ve tried several learning rates ranging from 10^-5 to 10^-1

I’m using one-hot vectors to encode the data. Each text input is padded to be of equal length, and thus the input shape is (batch_size=1, seq_len=text_length, input_size=one_hot_vectore_length).

I’m quite convinced the problem is come clear mistake in my model definition / forward function. If there’s any insights or mistakes you could point out, it would be great.

Also overall feedback and guidance on the approach to modelling the problem and designing the model are greatly appreciated, since I’m very much a beginner with PyTorch and LSTMs.

Thanks!

Below is the code for the model

class Model( nn.Module ):
    def __init__(self, input_size, hidden_size, num_layers, output_size, input_size_2, batch_size):
        super().__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.input_size_2 = input_size_2
        self.batch_size = batch_size
        
        #Layers for analysing model - used for one text input
        self.lstm = nn.LSTM( input_size, hidden_size, num_layers )
        self.fc1 = nn.Linear(hidden_size, output_size)
        
        #Layers for comparing model - input is concatenated outputs of two text-inputs passed to lstm & fc1
        self.fc2 = nn.Linear(input_size_2, 512)
        self.fc3 = nn.Linear(512, 1) # 1 output - the similarity
        
    def forward(self, inputs ):
        i1, i2 = inputs # Two text inputs
        x1, hidden1 = i1
        x2, hidden2 = i2
        
        #Pass first text-input
        x1, (hidden1, cell1) = self.lstm(x1, hidden1)
        x1 = hidden1.view(-1)
        x1 = self.fc1(x1)
        
        #Pass second text-input
        x2, (hidden2, cell2) = self.lstm(x2, hidden2)
        x2 = hidden2.view(-1)
        x2 = self.fc1(x2)
        
        #Calculate similarity based on both texts' outputs
        outs = torch.cat( ( x1, x2 ) )
        outs = self.fc2(outs)
        outs = self.fc3(outs)
        return outs

And below is the training loop

net = Model( input_size_1, hidden_size, n_layers, output_size, input_size_2, batch_size )
optimizer = torch.optim.Adam( net.parameters(), lr=lr )
loss_function = nn.L1Loss()

def train( n_epochs ):
    tX, ty = prepare_data( train_X, train_y, is_train=True )
    vX, vy = prepare_data( valid_X, valid_y )

    for i in range(n_epochs):
        for count, x in enumerate(tX):
            optimizer.zero_grad()
            
            y = ty[count]
            out = net(x)
            
            loss = loss_function( out, y )
            loss.backward()
            optimizer.step()

And finally the validation

with torch.no_grad():
      for count, x in enumerate(vX):
            y = vy[count]
            out = net(x) # Here, output is identical for every validation sample
            loss = loss_function( out, y )