Are you training your fastText model with character data instead of word level data? If you did that then I think you’d get the desired result. But you’d probably have to preprocess the training data into a format that’s suitable for fastText.
The problem is that I do not have access to a pytorch version of fastText, so that I am able to add the whole model as an embedding layer and train it. I have only trained fastText embeddings as described in here. Is possible to use the .bin file to continue fine tuning the whole fastText model?
This might be easier if you explained with an example what you’ve done and what you haven’t done. fastText basically creates an nn.EmbeddingBag weight matrix. All the other stuff is basically lookups for the various components. Have you compiled the fastText bindings for python? If so, you can put the weight matrix into a normal nn.EmbeddingBag.
If you train your fastText model with skipgrams then your lookup will also get partial words. So if you have a skipgram of 3 then the word “there” returns [“there”, “the”, “her”, “ere”]… for unknown words, you won’t get the whole word for OOV, but you would get the components… Now if your word has no components of any words then you’d get nothing, but that’s probably not going to happen if you train on enough data and the input is reasonable (i.e. same language as your train set).
So fastText doesn’t return anything for OOV. But OOV when you have partial words would be super rare. You are saying that the word you are looking up has no 3 letter sequence that is similar to any of the words in your training set?
word vectors for EVERY possible word and 3 letter sequence seen in train? or word vectors for words only? If the former, then the matrix will be too big to fit in memory, if the latter, OOV words that would have a fastText representation won’t have one, since wont be saved in the self.embbag matrix, correct?
I hope I made myself clear
I think the former. But you can also prune these matrices with fastText. But having said that…
It’d be way easier for you to create a small dummy set, train a fastText model, then test it out yourself. I’ve used fastText only a handful of times, so I wouldn’t really trust my own answers.
I am almost certain that fastText does not create a “not found” token. You could do that yourself because creating an EmbeddingBag with num_emb + 1 then copy the weight matrix from fastText over the appropriately.
Ok, I think I get what you mean. But I think that you should be asking the gensim people how they find out of word vectors. Having said that. I still think that for latin-based languages, it would be almost impossible to find a word that wasn’t composed of any smaller sub-sequences from other words. And to get these subwords, you use the function get_subwords. You’ll be able to get a representation of a word that is OOV, but still contains sub-sequences from other words.