If it helps, you can have a look at my code for that. You only need the
create_embedding_matrix method –
generate_embedding_matrix were my initial solution, but there’s not need to load and store all word embeddings, since you need only those that match your vocabulary.
max_index reflect the information from your vocabulary, with
word_to_index mapping each word to a unique index from
0..max_index (not that I’ve written it, you probably don’t need
max_index as an extra parameter). I use my own implementation of a vectorizer, but
torchtext should give you similar information.
A full example of how it works can been seen in this notebook.
To your other questions:
- There’s hardly ever one best solution out there, and new types embeddings are proposed on properly a weekly basis. My tip would be: Just the something running, see how it works, and then try different alternatives to compare.
- Of course you can get the embedding for a specific word. That’s essentially the content for the GloVe files. Each line contains first the word and then the
n values of the embedding vector (with
n being the vector size, e.g., 50, 100, 300)