Correct way to connect an RNN to a Linear layer

Hi

So I’ve been studying about language models and I’m confused about the proper way to connect the output of an RNN to a Linear layer

I’m aware that the method will differ based on the use case.

Like for example, feeding the hidden state to the Linear layer in a many to one task like sentiment analysis may be sufficient.

In the case of sequence generation,

class Net1(nn.Module):
    def __init__(self,n_inputs,hidden_size):
        self.hidden_size = hidden_size
        self.rnn = nn.LSTM(n_inputs,hidden_size,batch_first=True)
        self.fc =  nn.Linear(hidden_size,n_inputs)

    def forward(self,x,hidden):
        x,(hs,cs) = self.rnn(x)
        #hs of shape (num_layers,batch_size,hidden_size)
        x = self.fc(hs[-1]) # taking the hidden state of the last layer
        return x #x of shape (batch_size,n_inputs)

class Net2(nn.Module):
    def __init__(self,n_inputs,hidden_size):
        self.hidden_size = hidden_size
        self.rnn = nn.LSTM(n_inputs,hidden_size,batch_first=True)
        self.fc =  nn.Linear(hidden_size,n_inputs)

    def forward(self,x,hidden):
        x,(hs,cs) = self.rnn(x)
        #hs of shape (num_layers,batch_size,hidden_size)
        #x of shape (batch_size,seq_len,hidden_size)
        x = self.fc(x.view(-1,self.hidden_size))
        return x #x of shape (batch_size*seq_len,n_inputs)

class Net3(nn.Module):
    def __init__(self,n_inputs,hidden_size,seq_len):
        self.hidden_size = hidden_size
        self.seq_len = seq_len
        self.rnn = nn.LSTM(n_inputs,hidden_size,batch_first=True)
        self.fc =  nn.Linear(hidden_size*seq_len,n_inputs)

    def forward(self,x,hidden):
        x,(hs,cs) = self.rnn(x)
        #hs of shape (num_layers,batch_size,hidden_size)
        #x of shape (batch_size,seq_len,hidden_size)
        x = self.fc(x.view(-1,self.hidden_size*self.seq_len))
        return x #x of shape (batch_size,n_inputs)


From my understanding of character based text generation, we predict the next character based on the previous sequence_len characters.

Meaning the input, X should be of shape (batch_size,seq_len,n_inputs),
the output of the net should be of shape (batch_size,n_inputs)
and Y should be of shape (batch_size,1) (Indices of next chars)

But for the Net2 Architecture to work out, doesn’t it mean it’s predicting the next sequence_len characters?
and that’ll mean Y should be of shape (batch_size,seq_len,1) → (batch_size*seq,1)

And for Net3, I guess it’s based off the idea that using all the hidden states could improve the accuracy of the model.
Using that architecture would mean all inputs should be padded to specified length??

I would be glad if someone could help me understand this more clearly.

From my understanding of character based text generation, we predict the next character based on the previous sequence_len characters.

I think this is different from what we typically mean by text generation.

What we usually mean is that you are given a sequence of chars (x_0, x_1, …x_n), and you would want to generate another seq of chars (y_0, y_1, …y_n). y_0, i.e., the first char in the output seq would depend on the input seq (specifically, the output of the encoder that takes X as the input) but the rest would depend on the previous char (unless you are using attention).

I think it kinda makes sense.

would it mean predicting the next character given a sequence of characters is a wrong approach?

It’s not a wrong approach per se, but it is a different problem.

Okay, thanks.
Which of the three methods would you think is the best way to connect an RNN to a Linear layer?