Various questions about LSTMs that I have not found answers to

Hi,

I am new to working with LSTMs and I have multiple questions that I have not found questions to. If you have answers to any of the following please respond.

(1):
Given input of shape (batch_size, seq_length, 1) passed through
lstm = LSTM(input_size=input_size, hidden_size=hidden_size, batch_first=True), the output (output, _ = lstm(input)) is of shape (batch_size, seq_len, hidden_size).
(1.1):
If I then want to make a prediction on this using fully connected layer, if I understand correctly, I only need the final feature in the sequence. Is this correct?
(1.2):
So this would be fully_connected(output[:,-1, :]). Is this correct?

(2):
If sequences in a batch have different lengths, I can store the sequences in a list: batch = [sequence_1, ..., sequence_n], where sequence_i are tensors of shape (seq_length_i, 1). This can be wrapped in the following: batch = torch.nn.utils.rnn.pack_sequence(batch). This can then be passed through the LSTM described in Q(1). The output, however, is still of type PackedSequence and I want the output features of each sequence in the batch. I have written the following code:
pad_packed_sequence(a)[0][-1, :, :]
(2.1):
my interpretation of the above code is that I am extracting the final output of the entire sequence of the lstm for each sequence in the batch (plus padding). Is this right?
(2.2):
More theoretical, if I pass a packed sequence through an LSTM, will each sequence be passed independently? I.e. will my output be equivalent to the output if I had passed each sequence independently and then concatenate the outputs?

(3):
If I have I set of sequences with different lengths, which I want to process with 1D convolution before passing through an LSTM, is there any way I can pass I batch of sequences through such a network? Something of the form: rnn.pack_sequence([seq_1, ..., seq_n]) -> Conv1D -> LSTM -> Linear. Of course an PackedSequence can not be passed through Covn1D so is there an alternative?

(4):
If I have an architecture that looks like the following: x -> LSTM -> LSTM -> Linear -> prediction, is it common practice to apply an activation function on the outputs of the LSTMs?

Thanks in advance!!

P.s. if there are any tutorials out there I’d also be happy if you see those. I have done some digging it seems like there are very few resources. It would be great if there were more end to end tutorials by PyTorch.

  1. correct
  2. I think incorrect - you’ll receive padding values for shorter sequences, so you have to do something like gather(output, indexes=(seq_lens-1).expand(…), dim=time). If you’d use GRU, you could just use final hidden_states (they individually stop changing as soon as sequences end and are returned without time dimension), but for LSTM hidden state != output (usually denoted as c and h)
    2.2) not sure I understand, but packed batched sequences are still processed together, for efficiency reasons. it is more about implementation details.
  3. you should use “causal convolutions” for timeseries, i.e. convolution outputs should be shifted as to avoid information leaks into the “future”. I can’t advise any elegant solutions for using convolutions with variable length sequences…
  4. RNN outputs usually already come from a non-linear activation function, so no. It is possible to insert a LayerNorm layer there instead.

Thanks for your detailed response!

2.1:
I understand the first remark (about the sequence lengths). I don’t how I could make use of these hidden states. The output of lstm is: output, (h_n, c_n), and the output of GRU is: output, h_n. I thought output is the only relevant output for learning tasks, for example we can apply fully connected layers on the output. How would I make use of h_n?

2.2:
Okay so yea I want it to be processed together so that it is efficient, but I want to output to be as if I did not pass them together. So for example if I have two sequences Seq1 and Seq2, and pass them independently through the same LSTM the outputs would be LSTM(Seq1) and LSTM(Seq2). Can I extract the same outputs if I pass the sequences as a packed sequence (like how I can extract the individual outputs of a batch when passing a batch of inputs through a standard fully connected layer)?

3:
I see I have not heard about these until now. I will look into this!

Edit: do you recommend using GRU instead of LSTM?
Edit2: I have verified 2.2. This is correct and no longer needs an answer.

“output” is just h unrolled in time. so, actually, I think gather() is rarely needed with LSTMs too, as you can just use h_n

do you recommend using GRU instead of LSTM?

It is subjective, but I think GRU is good enough for “collapsing” tasks. It is “naked” having no output gate, so when you use “output” from all timesteps, LSTM may work better in theory.

2.2. neither batching nor packing affects output (except padding areas)

Okay great, you have cleared up everything. Thanks!

Just a follow-up question that is more general; I am asking because my framework is virtually failing to learn.

At some stage in the framework, inputs in a batch have different sequence lengths, I have decided to pass the inputs through a linear layer in a for loop. This means passing each element in the batch through the linear layer iteratively. Afterward, I pack the sequence to pass through an LSTM. Is there any error that comes from passing n elements in a batch with the shape (1, seq_len_i, H) during a forward pass (other than computational efficiency)?

no, it is just a transformation of (*,H) vectors regardless. still, it is better to transform a padded (batch, time, H) tensor and then use pack_padded_sequence.

RNNs are just hard to train, for complex tasks and subtle patterns.