LSTM loss not improving

Background

I am using an LSTM to model sequential events for a binary classification problem. The dataset is

  • highly imbalanced (ratio=0.1)
  • unique tokens are <100
  • tokens do not strongly link with natural language (token = semi natural language encoding of events)
  • high number of datapoints (>10^6)

To improve performance the an additional fully connected NN port for static features (postcode, height, violations etc.) is combined with the LSTM module.

Model architecture

Pasted below pseudocode for model.


class my_network(nn.Module):
    def __init__(
        self,
        vocab_size,
        output_size,
        embedding_dim,
        latent_dim,
        n_lstm,
        bidirectional=False,
        recurrent_dropout=0.3,
        dropout=0.3,
        n_aux=2,
        n_fc=1,
        n_neurons=None,
    ):
 
        super(my_network, self).__init__()

        self.output_size = output_size
        self.n_lstm = n_lstm
        self.latent_dim = latent_dim
        bidirectional = bool(bidirectional)
        self.bidirectional = bidirectional
        dir = 1 if bidirectional == 0 else 2
        self.dir = dir
        self.n_aux = n_aux
        self.n_fc = n_fc
        if n_neurons is None:
            n_neurons = (latent_dim * dir) + n_aux
        self.n_neurons = n_neurons

        self.embedding = nn.Embedding(vocab_size + 1, embedding_dim)
        self.lstm = nn.LSTM(
            embedding_dim,
            latent_dim,
            n_lstm,
            dropout=0 if n_lstm == 1 else recurrent_dropout,
            bidirectional=bidirectional,
            batch_first=True,
            bias=True,
        )
        self.batchnorm = nn.BatchNorm1d(dir * latent_dim)
        self.dropout = nn.Dropout(dropout)

        # Instantiate merger forward neural network.
        self.fc_aux = nn.Linear(n_aux, n_aux)
        self.fc = nn.ModuleList()

        for i in range(self.n_fc):
            if i == 0:
                if n_fc == 1:
                    self.fc.append(nn.Linear((latent_dim * dir) + n_aux, output_size))
                else:
                    self.fc.append(nn.Linear((latent_dim * dir) + n_aux, n_neurons))
            elif i == n_fc - 1:
                self.fc.append(nn.Linear(n_neurons, output_size))
            else:
                self.fc.append(nn.Linear(n_neurons, n_neurons))

    def forward(self, seq_x, aux_x, hidden):
 
        batch_size = seq_x.size(0)

        embedding = self.embedding(seq_x)
        lstm, (h_out, cell_out) = self.lstm(embedding, hidden)
        # Use hidden state for merger.
        lstm_out = h_out
        lstm_out = lstm_out[-1]

        lstm_out = self.dropout(lstm_out)
        lstm_out = self.batchnorm(lstm_out)
        ##-- END sequential modelling branch --##

        aux_out = self.fc_aux(aux_x)

        X = torch.cat([lstm_out, aux_out], dim=1)
        X = F.relu(X)
        # out = self.fc(X)
        for i in range(self.n_fc):
            if i == self.n_fc - 1:
                # Apply sigmoid function.
                torch.sigmoid(self.fc[i](X))
            else:
                X = F.relu(self.fc[i](X))

        return out, (h_out, cell_out)

    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        hidden = (
                weight.new(self.dir * self.n_lstm, batch_size, self.latent_dim).zero_(),
                weight.new(self.dir * self.n_lstm, batch_size, self.latent_dim).zero_(),
            )

        return hidden

Questions

Validation loss plateaues at 0.3 and it seems impossible to improve average precision score no matter what I try. I have tried tuning the network and also tried to model

  • using hidden state
  • using lstm_output
  • weighted loss using pos_weight in nn.BCELossWithLogits (amending the sigmoid in the forward)

Model auditing showed that:

  • weights are being updated in training steps
  • grads do flow to the very back of the layers and look generally healthy

Potential signs of issues:

  • Although biases do change over time, weight parameters do not update especially for the LSTM layers, e.g. like seen below:

Any advise would be very much appreciated.