Bidirectional lstm, why is the hidden state randomly initialized?

I’m looking at a lstm tutorial. In this tutorial, the author seems to initialize the hidden state randomly before performing the forward path.

        hidden_a = torch.randn(self.hparams.nb_lstm_layers, self.batch_size, self.nb_lstm_units)
        hidden_b = torch.randn(self.hparams.nb_lstm_layers, self.batch_size, self.nb_lstm_units)

it makes more sense to me to initialize the hidden state with zeros.

is random initialization the correct practice?

In this case, the author is treating the initial state as a learned value (see this block of code). In that case, it makes sense to use a randomly initialized vector to break symmetry, just like any other parameter.

Learned initial states are atypical – most architectures I’ve come across use a zero initial state. In PyTorch, you would just omit the second argument to the LSTM object.

I can’t see the model learning the initial state. init_hidden() gets called for every call of the forward() method, i.e., for each batch. So each batch starts with a new random initial state. What exactly is learned here?

As far as I can tell, the learning the initial state is done by initializing the hidden state once when creating the model, and then you only detach() the hidden state for each new batch.

Creating a new random hidden state for each batch probably doesn’t hurt much – I don’t know, to be honest. But it certainly results in the case where the same example – input sequence and target sequence/class) is trained with different initial initial hidden states. Not sure which effects this has.

1 Like

Can’t be sure without consulting the author, but I think the intent was to treat the initial state as a learned value. But you’re right that the implementation doesn’t do that since init_hidden() is called in forward() (which I missed). What they probably should’ve done is called init_hidden() once inside __build_model() and not reassigned self.hidden.

Out of curiosity, I trained a simple binary classifier (LSTM with attention) on a text dataset of mine. Once with random initialization, once with zero’d initialization of the hidden state for each batch:

ZEROES
Training accuracy: 0.970, test accuracy: 0.840
Training accuracy: 0.975, test accuracy: 0.841
Training accuracy: 0.970, test accuracy: 0.840

RANDOM
Training accuracy: 0.927, test accuracy: 0.847
Training accuracy: 0.923, test accuracy: 0.845
Training accuracy: 0.926, test accuracy: 0.849

The result are not unexpected, I think. Using zero’d hidden states yields a higher training accuracy since the same sentence never starts with a different hidden state. In fact, all sentences are treated equally given the initial hidden state is the same – I don’t think it’s important that the initial state is all zero’s, it’s just important that it’s the same for each batch (even if it’s set randomly at the very beginning).

The test accuracy is a tad better for a random initialization. I guess one could argue that a random initialization introduces some kind of regularization that avoids overfitting (lower training accuracy) but generalizes a bit better (higher test accuracy).

Disclaimer: This was just a quick-and-dirty test with a simple model and small-ish dataset. Please don’t use these results to make any deeper conclusions :).

I’ve been confused by this exact example myself - because init_hidden is on forward, it means that not only during training is the initial state (per batch) random, but also during validation and testing?

It seems to me that it’s something you should call in the training loop (per batch or per epoch), but then I’m not sure what initial state you’d use for inference. You probably want to use the final state from the previous batch if you’re predicting from a windowed time-series?