XLM/BERT sequence outputs to pooled outputs with weighted average pooling

Let’s say I have a tokenized sentence of length 10, and I pass it to a BERT model.

bert_out = bert(**bert_inp)
hidden_states = bert_out[0]
hidden_states.shape
>>>torch.Size([1, 10, 768])

This returns me a tensor of shape: [batch_size, seq_length, d_model] where each word in sequence is encoded as a 768-dimentional vector

In TensorFlow BERT also returns a so called pooled output which corresponds to a vector representation of a whole sentence.
I want to obtain it by taking a weighted average of sequence vectors and the way I do it is:

hidden_states.view(-1, 10).shape
>>> torch.Size([768, 10])

pooled = nn.Linear(10, 1)(hidden_states.view(-1, 10))
pooled.shape
>>> torch.Size([768, 1])
  • Is it the right way to proceed, or should I just flatten the whole thing and then apply linear?
  • Any other ways to obtain a good sentence representation?

When building a sequence using special tokens, this is not the token that is used A *BaseModelOutputWithPoolingAndCrossAttentions (if return_dict=True is passed or softmax, used to compute the weighted average in the self-attention heads. head on top (a linear layer on top of the pooled output ) e.g. for GLUE tasks.

Thanks for the information