Seq2seq: Replace the embeddings with pre-trained word embeddings such as word2vec

liran · July 16, 2020, 1:36pm

Hi,

I am following a seq2seq tutorial:
https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

I want to use pre-trained vectors (Word2Vec) instead of word2index as we can see in the tutorial. I have edited the code to get the vector of the word rather than the index:

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def get_word2vec(self):
        word2vec = KeyedVectors.load_word2vec_format('Models/Word2Vec/wiki.he.vec')
        return word2vec
    
    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.get_word2vec[word]
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

the dimension size of this word2vec is 300 dimensions

Is this the right way to do that?
May I need to change other things in my Encoder\decoder\ NN?

Thank you!

PANKAJ_CHANDRAVANSHI · August 22, 2022, 5:07pm

hey, i am also looking for the same solution. have you found any?