Bert Embedings are same for all layers

Zoran_Peric · May 14, 2019, 5:02pm

I am using pretrained Bert model by Huggingface from https://github.com/huggingface/pytorch-pretrained-BERT to get word emebedings by getting hiden states.

However it seems that the emebeding vectors that I get are the same no matter which layer I choose and also it seems that for example Flair implementation gives different results, when doing cosine similarity between different words.

What I am doing wrong? Is there any processing I have to do with the hiddent states to convert them to vectors. Maybe normalising or something like this.

   tokens_tensor_1 = torch.tensor([indexed_tokens_1])
    segments_tensors_1 = torch.tensor([segments_ids_1])

    
    tokens_tensor_1 = tokens_tensor_1.to('cuda')
    segments_tensors_1 = segments_tensors_1.to('cuda')

    with torch.no_grad():
        hidden_states_1, _  = model(tokens_tensor_1)#, segments_tensors_1)

    
    print ("tokenized_text_1:",tokenized_text_1)   
    vectors = []
    for index in range(0, len(tokenized_text_1)):
        torch_vector = hidden_states_1[0][0][index]
        torch_vector = torch_vector.to('cpu')
        numpy_vector = torch_vector.numpy()
        vectors.append(numpy_vector)`

Zoran_Peric · May 14, 2019, 5:13pm

I was lucky and found that normalising was the problem. So I did vector/np.linalg.norm(vector) and now all layers gives different cosine similarity between words.

And then I get the last layer by using:
torch_vector = hidden_states_1[-1][0][index]