LSTM for classification (fraud detection) over several lines of text

Monkee_Motion · February 7, 2025, 10:42am

hey everyone,
I’m in a little trouble here to find the right approach:

I have log data where an event spans over several lines of text with timestamps (converted to timedeltas to previous line for the dataset) so events are arbitrary in length but around 20 lines long and not grouped, now i want to binary classify into supspicious and normal lines (sparse, 0.95 is normal, 0.05 is suspicious in training data);
my approach was to take 48 line windows and join them into a single token sequence where the label is the line in the middle (line 24) such that the model has enough context :


  class TokenizedDataset(Dataset):
      def __init__(self, dataframe, tokenizer, max_length=32, window_size=48):
          self.data = dataframe
          self.tokenizer = tokenizer
          self.max_length = max_length  # Tokens per line
          self.window_size = window_size  # Total lines per sample (24 before + 1 + 23 after)
  
      def __len__(self):
          return len(self.data)
  
      def __getitem__(self, idx):
          # Define start and end indices for the window
          start_idx = max(0, idx - self.window_size//2)  #24 lines before
          end_idx = min(len(self.data), idx + self.window_size//2)  #23 lines after
  
          # Extract window data
          window_data = self.data.iloc[start_idx:end_idx]
  
          # Combine all traces within the window into one input text
          text = " ".join(window_data['inputs'].tolist())
  
          # The label is from the current `idx`
          label = self.data['label'].iloc[idx]  
  
          # Tokenize and pad the input text
          encoding = self.tokenizer.encode_plus(
              text,
              add_special_tokens=True,
              max_length=self.max_length * self.window_size, 
              padding='max_length',
              truncation=True,
              return_tensors='pt'
          )
  
          # Get the input_ids (remove batch dimension)
          input_ids = encoding['input_ids'].squeeze(0)
  
          return {
              'input_ids': input_ids,
              'label': torch.tensor(label, dtype=torch.long)
          }

How should i tweak the model forward now that the prediction is indeed for the line in the middle:

  class BiLSTM(nn.Module):
      def __init__(self, vocab_size, embedding_dim=128, hidden_dim=64, output_dim=2, num_hidden=2, dropout=0.3):
          super(BiLSTM, self).__init__()
          self.embedding = nn.Embedding(vocab_size, embedding_dim)
          self.hidden_dim = hidden_dim
          self.num_hidden = num_hidden
          
          self.lstm_layers = nn.ModuleList()
          self.lstm_layers.append(nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True))
          self.dropout = nn.Dropout(dropout)
          
          for _ in range(1, num_hidden):
              self.lstm_layers.append(nn.LSTM(2 * hidden_dim, hidden_dim, batch_first=True, bidirectional=True))
          
          self.fc = nn.Linear(2 * hidden_dim, output_dim) 
  
      def forward(self, x):
          x = self.embedding(x)
          for i, lstm in enumerate(self.lstm_layers):
              x, _ = lstm(x)
              if i < len(self.lstm_layers) - 1:
                  x = self.dropout(x)
          
          # Use the output from the last time step (last token)
          x = x[:, -1, :]  # Shape: [batch_size, hidden_dim * 2] (take the last hidden state)
  
          # Pass the last hidden state through the fully connected layer
          x = self.fc(x)  # Shape: [batch_size, output_dim]
          
          return x

for now i guess taking the last hidden state is the wrong approach as this would point to the last line right?

Any ideas/recommendations how to approach this?
For now my model does only learn to predict the majority class…

Best,
Seb