Use FastText Embedding model on PyTorch

malioboro · April 2, 2019, 9:52am

As explained in the documentation of Embedding Layer:

Embedding Layer is “a simple lookup table”.

I think it means I can’t use it to handle OOV. But as I have a FastText model for my language, what’s the best way to use my FastText model together with PyTorch?

Julio_Marco_A_Silva · January 29, 2020, 12:12pm

I know this answer comes “a bit” late, but…

An embedding layer turns known words into embedding vectors, and yes: it is just a lookup table that maps words to their vectors (although those vectors are trained, not just assigned).

IMHO, the best way, if you are worried about OOV words, is to train your embedding layer with texts from your domain, using a corpus as large and varied as you can get. In this way you will minimize the impact of OOV words, as the most relevant words that give meaning to the sentences in your domain will be well covered.

sai_m · January 29, 2020, 2:54pm

As suggested by @Julio_Marco_A_Silva, best way would be to train on custom data set. If we still face OOV, one way to initialize OOV is using unk_init = torch.Tensor.normal_, while loading pre trained vectors. so that, pytorch will initialize unknown words via Gaussian distribution and this can be applied to train and test sets