Let’s say I have a tokenized sentence of length 10, and I pass it to a BERT model.
bert_out = bert(**bert_inp)
hidden_states = bert_out[0]
hidden_states.shape
>>>torch.Size([1, 10, 768])
This returns me a tensor of shape: [batch_size, seq_length, d_model] where each word in sequence is encoded as a 768-dimentional vector
In TensorFlow BERT also returns a so called pooled output which corresponds to a vector representation of a whole sentence.
I want to obtain it by taking a weighted average of sequence vectors and the way I do it is:
hidden_states.view(-1, 10).shape
>>> torch.Size([768, 10])
pooled = nn.Linear(10, 1)(hidden_states.view(-1, 10))
pooled.shape
>>> torch.Size([768, 1])
- Is it the right way to proceed, or should I just flatten the whole thing and then apply linear?
- Any other ways to obtain a good sentence representation?