Hi
So I’ve been studying about language models and I’m confused about the proper way to connect the output of an RNN to a Linear layer
I’m aware that the method will differ based on the use case.
Like for example, feeding the hidden state to the Linear layer in a many to one task like sentiment analysis may be sufficient.
In the case of sequence generation,
class Net1(nn.Module):
def __init__(self,n_inputs,hidden_size):
self.hidden_size = hidden_size
self.rnn = nn.LSTM(n_inputs,hidden_size,batch_first=True)
self.fc = nn.Linear(hidden_size,n_inputs)
def forward(self,x,hidden):
x,(hs,cs) = self.rnn(x)
#hs of shape (num_layers,batch_size,hidden_size)
x = self.fc(hs[-1]) # taking the hidden state of the last layer
return x #x of shape (batch_size,n_inputs)
class Net2(nn.Module):
def __init__(self,n_inputs,hidden_size):
self.hidden_size = hidden_size
self.rnn = nn.LSTM(n_inputs,hidden_size,batch_first=True)
self.fc = nn.Linear(hidden_size,n_inputs)
def forward(self,x,hidden):
x,(hs,cs) = self.rnn(x)
#hs of shape (num_layers,batch_size,hidden_size)
#x of shape (batch_size,seq_len,hidden_size)
x = self.fc(x.view(-1,self.hidden_size))
return x #x of shape (batch_size*seq_len,n_inputs)
class Net3(nn.Module):
def __init__(self,n_inputs,hidden_size,seq_len):
self.hidden_size = hidden_size
self.seq_len = seq_len
self.rnn = nn.LSTM(n_inputs,hidden_size,batch_first=True)
self.fc = nn.Linear(hidden_size*seq_len,n_inputs)
def forward(self,x,hidden):
x,(hs,cs) = self.rnn(x)
#hs of shape (num_layers,batch_size,hidden_size)
#x of shape (batch_size,seq_len,hidden_size)
x = self.fc(x.view(-1,self.hidden_size*self.seq_len))
return x #x of shape (batch_size,n_inputs)
From my understanding of character based text generation, we predict the next character based on the previous sequence_len
characters.
Meaning the input, X
should be of shape (batch_size,seq_len,n_inputs),
the output of the net should be of shape (batch_size,n_inputs)
and Y
should be of shape (batch_size,1) (Indices of next chars)
But for the Net2
Architecture to work out, doesn’t it mean it’s predicting the next sequence_len
characters?
and that’ll mean Y
should be of shape (batch_size,seq_len,1) → (batch_size*seq,1)
And for Net3
, I guess it’s based off the idea that using all the hidden states could improve the accuracy of the model.
Using that architecture would mean all inputs should be padded to specified length??
I would be glad if someone could help me understand this more clearly.