[NOT SOLVED] Training character level embeddings in FastText

Is it possible to load/train FastText weights and incorporate it in a pytorch model so that it can update character level weights and therefore train embeddings for unknown tokens also?

I had a look at this but it seems that all the script does is to load the binary weights file into a word embedding matrix.

Are you training your fastText model with character data instead of word level data? If you did that then I think you’d get the desired result. But you’d probably have to preprocess the training data into a format that’s suitable for fastText.

The problem is that I do not have access to a pytorch version of fastText, so that I am able to add the whole model as an embedding layer and train it. I have only trained fastText embeddings as described in here. Is possible to use the .bin file to continue fine tuning the whole fastText model?

This might be easier if you explained with an example what you’ve done and what you haven’t done. fastText basically creates an nn.EmbeddingBag weight matrix. All the other stuff is basically lookups for the various components. Have you compiled the fastText bindings for python? If so, you can put the weight matrix into a normal nn.EmbeddingBag.

Below is an example of putting a fastText model into a dataset / dataloader.

But what does it actually do? Isn’t it just creating a lookup table for words? What about OOV words? Am I able to fine tune the character level embeddings with nn.EmbeddingBag?

If you train your fastText model with skipgrams then your lookup will also get partial words. So if you have a skipgram of 3 then the word “there” returns [“there”, “the”, “her”, “ere”]… for unknown words, you won’t get the whole word for OOV, but you would get the components… Now if your word has no components of any words then you’d get nothing, but that’s probably not going to happen if you train on enough data and the input is reasonable (i.e. same language as your train set).

Makes sense. However, nn.EmbeddingBag is just a regular lookup table for word vectors, isn’t it? How does it generate embeddings for words that are not in the matrix?

So fastText doesn’t return anything for OOV. But OOV when you have partial words would be super rare. You are saying that the word you are looking up has no 3 letter sequence that is similar to any of the words in your training set?

I will be more specific. Thanks for your help so far btw :slight_smile:

input_matrix = fasttext_model.get_input_matrix()  # numpy
num_emb, emb_dim = input_matrix.shape
self.embbag = nn.EmbeddingBag(num_emb, emb_dim)

What is inside this input_matrix?

word vectors for EVERY possible word and 3 letter sequence seen in train? or word vectors for words only? If the former, then the matrix will be too big to fit in memory, if the latter, OOV words that would have a fastText representation won’t have one, since wont be saved in the self.embbag matrix, correct?
I hope I made myself clear

I think the former. But you can also prune these matrices with fastText. But having said that…

It’d be way easier for you to create a small dummy set, train a fastText model, then test it out yourself. I’ve used fastText only a handful of times, so I wouldn’t really trust my own answers. :slight_smile:

I am almost certain that fastText does not create a “not found” token. You could do that yourself because creating an EmbeddingBag with num_emb + 1 then copy the weight matrix from fastText over the appropriately.

input_matrix = fasttext_model.get_input_matrix()  # numpy
num_emb, emb_dim = input_matrix.shape
self.embbag = nn.EmbeddingBag(num_emb + 1, emb_dim)
self.embbag.weight.data[:num_emb].copy_(torch.from_numpy(input_matrix))

I haven’t tried it but something like that might work.

The point is that you are able to get the OOV embeddings using, for example, gensim. Check this issue. I’d just like to finetune these embeddings also

Ok, I think I get what you mean. But I think that you should be asking the gensim people how they find out of word vectors. Having said that. I still think that for latin-based languages, it would be almost impossible to find a word that wasn’t composed of any smaller sub-sequences from other words. And to get these subwords, you use the function get_subwords. You’ll be able to get a representation of a word that is OOV, but still contains sub-sequences from other words.

Thanks for your help. Let’s see if someone know how to finetune subword tokens.

Hello, I haven’t tested this yet but i believe this should finetune the subword embeddings. fastText/FastTextEmbeddingBag.py at master · facebookresearch/fastText · GitHub