When dealing with texts, you usually first build your own Vocabulary to map between words and indexes, and vice versa. This could be a utility class like like this (this is a class I wrote myself):
class Vocabulary:
def __init__(self, default_indexes={}):
self.default_indexes = {**default_indexes}
self.init()
def init(self):
self.index_to_word = {**self.default_indexes}
self.word_to_index = {}
self.word_counts = {}
self.num_words = len(self.default_indexes)
for idx, word in self.index_to_word.items():
self.word_to_index[word] = idx
def index_words(self, word_list):
for word in word_list:
self.index_word(word)
def index_word(self, word, cnt=None):
if word not in self.word_to_index:
self.index_to_word[len(self.index_to_word)] = word
self.word_to_index[word] = len(self.word_to_index)
if cnt is None:
self.word_counts[word] = 1
self.num_words += 1
else:
self.word_counts[word] = cnt
self.num_words += cnt
else:
if cnt is None:
self.word_counts[word] += 1
else:
self.word_counts[word] += cnt
def get_words(self, indices):
return [self.index_to_word[i] if i in self.index_to_word else None for i in indices ]
# Testing
vocabulary = Vocabulary(default_indexes={0: '<pad>', 1: '<unk>'})
print(vocabulary.index_to_word)
vocabulary.index_word('test')
print(vocabulary.index_to_word)
Essentially, you now have the dictionary self.word_to_index
that maps a word in your dataset to an index, e.g.:
self.word_to_index ={0: '<pad>', 1: '<unk>', 2: 'and', 3: 'I', 4: 'the', 5: 'be', ...}
Given a sentence “I will be tired and exhausted”, you can use this dictionary to convert this sentence into a tensors, e.g., input = [3, 73, 5, 310, 2, 511]
(maybe with padding in case of batches). Now, input
is what you give to ‘self.embeddings’ – you do not five the embedding layer words!
In case you use pre-trained word embeddings, yes, you have to make sure that the embedding at position, say, 5 does indeed represent the word “be” with respect to your vocabulary. I do this using the following method – note that this methods has word_to_index
(i.e., your vocabulary) as input parameter:
def create_embedding_matrix(self, embeddings_file_name, word_to_index, max_idx, sep=' ', init='zeros', print_each=10000, verbatim=False):
# Initialize embeddings matrix to handle unknown words
if init == 'zeros':
embed_mat = np.zeros((max_idx + 1, self.embed_dim))
elif init == 'random':
embed_mat = np.random.rand(max_idx + 1, self.embed_dim)
else:
raise Exception('Unknown method to initialize embeddings matrix')
with open(embeddings_file_name) as infile:
# Run through each line in the embedding file
# Usual layout: word coef1 coef2 coef3 ... coefN
for idx, line in enumerate(infile):
elem = line.split(sep)
word = elem[0]
# If the word is not in the vocabulary, we can skip ot
if word not in word_to_index:
continue
# Get index of current word given the vocabulary
word_idx = word_to_index[word]
# Put the pretrained word embedding into the "correct" position of your embedding matrix
if word_idx <= max_idx:
embed_mat[word_idx] = np.asarray(elem[1:], dtype='float32')
# Return embedding matrix
return embed_mat
Note: max_idx
is either the largest index in your vocabulary, or a value between 0 and the largest index in case you want to restrict your vocabulary. Example usage:
embed_mat create_embedding_matrix('glove.840B.300d.txt', word_to_index)
Finally, I use embed_mat
to set the weights of the embedding layer of my model:
model.embedding.weight.data.copy_(torch.from_numpy(embed_mat))
if fix_embeddings == True:
model.embedding.weight.requires_grad=False
else:
model.embedding.weight.requires_grad=True
I know that nn.Embedding
now has a method from_pretrained
and there’s also torchtext
that probably makes life easier, but I prefer handling these steps “on my own”. Firstly, it’s pretty straightforward, and secondly, it makes it easier to tweak and customize those steps.
I hope that helps and gets you at least on the right track.