Hello,
On my current project I’m using the google word2vec embedding googlenews-vectors-negative300.bin
However I was surprised that a lot of word in my text are nor referenced in the embedding(like xenophobia, submissive etc).
Firstly, I wanted to know how I can extand a nn.Embedding
with new words. I guess I should then activate backpropagation on this part of the embedding for it to be learned.
Secondly, I don’t know why but I need to pass by gensim to load the embedding, Indeed
text_field = data.Field(sequential=True, tokenize=_tokenize_str)
dataset = TabularDataset(
path='mydata.csv',
format='csv',
fields=[('id',None),('content',text_field )],
skip_header=False)
text_field.build_vocab(dataset)
vectors = vocab.Vectors('/data/GoogleNews-vectors-negative300.bin.gz')
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))
does not work, instead I need to do first:
model = gensim.models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)
model.wv.save_word2vec_format('data/myGoogleEmbedding.bin')
vectors = vocab.Vectors('/content/drive/My Drive/ActNews/data/myGoogleEmbedding.bin')
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))
Best regards,
Barthélémy