I have a word2vec model which I loaded the embedded layer with the pretrained weights
However, I’m currently stuck when trying to align the index of the torchtext vocab fields to the same indexes of my pretrained weights
Loaded the pretrained vectors successfully.
model = gensim.models.Word2Vec.load('path to word2vec model')
word_vecs = torch.FloatTensor(model.wv.syn0)
embedding = nn.Embedding(n_embed, embed_dim).from_pretrained(word_vecs)
However, I’m stuck in terms of using torchtext.build_vocab to align or have the same indexes as my word2vec model
i.e. if I do text.build_vocab(training_data)
i could get a an stoi of the following:
<unk> : 0
<pad> : 1
hello: 2
world: 3
bye: 4
but problem is that in my word2vec embedding, the index of the weight are for different strings and their for the weights are for different indexes
i.e. in my word2vec index assuming my dimension are 2
good: 0 => [0.34, 0.56]
bye: 1 => [0.34, 0.47]
day: 2 => [0.98, 0.67]
morning: 3 => [0.43, 0.67]
all: 4 => [0.96, 0.76]
hello: 68 => [0.12, 0.34]
world: 50 => [0.28, 0.96]
So the problem is when the torchtext goes to convert the indexes, because the indexes do not align with the indexes of the word2vec model, the incorrect embeddings are assigned.
i.e.
sample input = "hello world bye"
torchtext ouput index => [2,3,4]
embedding output => [[0.98, 0.67], [0.43, 0.67], [0.96, 0.76]]
but it should be:
torchtext output index => [68,50,1]
embedding output => [[0.12, 0.34],[0.28, 0.96], [0.34, 0.47]]
I would be grateful for any solutions or suggestions to get this to work properly, I wanted to avoid having to do the word2indx conversion myself and leverage the torchtext build_vocab because it takes care of the padding and unknown token along with many other conveniences.
Cheers!