I am using pretrained Bert model by Huggingface from https://github.com/huggingface/pytorch-pretrained-BERT to get word emebedings by getting hiden states.
However it seems that the emebeding vectors that I get are the same no matter which layer I choose and also it seems that for example Flair implementation gives different results, when doing cosine similarity between different words.
What I am doing wrong? Is there any processing I have to do with the hiddent states to convert them to vectors. Maybe normalising or something like this.
tokens_tensor_1 = torch.tensor([indexed_tokens_1])
segments_tensors_1 = torch.tensor([segments_ids_1])
tokens_tensor_1 = tokens_tensor_1.to('cuda')
segments_tensors_1 = segments_tensors_1.to('cuda')
with torch.no_grad():
hidden_states_1, _ = model(tokens_tensor_1)#, segments_tensors_1)
print ("tokenized_text_1:",tokenized_text_1)
vectors = []
for index in range(0, len(tokenized_text_1)):
torch_vector = hidden_states_1[0][0][index]
torch_vector = torch_vector.to('cpu')
numpy_vector = torch_vector.numpy()
vectors.append(numpy_vector)`