Create sentence embeddings from word embeddings by 2-layer RNN

I have a text processing task. I have a dataset that each sample of the dataset consists of a sentence as its input and a real score as its output.

Now I want to predict the score for each sentence by a 2-layer RNN networks.

The first layer works at word level. This layer accepts a batch of sentences. This layer produces sequentially word representations (hidden state representations in each word position).

The second layer works at sentence level. The input of this layer is the average of the word representations from the previous layer. This layer generates sentence representations.

I have problem in understanding the inputs of the second layer. I have created such a network:

import torch.nn as nn
class RNNModel(nn.Module):
def __init__(self):
    #embedding_dimension=50
    self.rnn1 = nn.RNN(input_size=50, hidden_size=50)
    self.rnn2=nn.RNN(input_size=50, hidden_size=50 )

def forward(self, input, hidden1, hidden2):
    # input is a batch of sentences
    embedded=self.embedding(input)
    output1, hidden1 = self.rnn1(embedded, hidden1)

I know that the size of the output1 is (seq_len, batch, num_directions x hidden_size) . The average of the hidden representations in rnn1 is the input for rnn2. Then the size of the input of the second layer is (1, batch, num_directions x hidden_size) . Averaging is done on all of the hidden representations in all of the time steps (number of time steps= seq_len).

    input2=Average(output1)
    output2, hidden2= self.rnn2(input2, hidden2)

I think that I can not feed input2 to rnn2 as its input. Because the size of the input2 should be (seq_len, batch, input_size) . But now this is (1, batch, num_directions x hidden_size) .

In the first layer, I was processing sequences of words (sentences). Each time I feed a batch of sentences to rnn1. So the size of input1 was (seq_len= length of longest sentence in that batch, batch=BATCH_SIZE, input_size=word_embedding_dimension) . On the other hand, in rnn2, I’m processing a sequence of sentences. I think that the size of input2 should be (seq_len= BATCH_SIZE, batch=1, input_size=num_directions_of_rnn1 x hidden_size1) . I’m confused about this inconsistency of sizes. I have problems in these cases:

(1) after averaging of the hidden representations of rnn1, how can I feed it to rnn2?

(2)should I reshape the average representation?

(3) what is the size of the input2?

Thanks in advance.