Why is the shape of hidden num_layers * num_directions?

I was seeing the LSTM and was wondering why the hidden = (h_n,c_n) was (num_layers * num_directions, batch, hidden_size). In particular why is it not:

(num_layers * num_directions, seq_len, batch, hidden_size)

Oh I realized that its only the first RNN block that needs a hidden layer…so for each sequence in the batch we need a hidden state. So how many initial hidden states do we need? One for each sequence batch and one for each direction and one for each each layer that we have. Which gives num_layers * num_directions * batch

but each has hidden_size dimension. So we get:

(num_layers * num_directions, batch, hidden_size)