Hi All!
I’m working on an NLP problem, and was wondering what the best practice was w.r.t GloVe/Pre-trained embeddings. Obviously we’re trying to eliminate data leakage.
Approach 1:
Load all Glove Vectors temporarily, select the ones for the words in your train+test set. Add a “UNK” randomly initialized vector for words in your train+test that don’t appear in Glove Set
Approach 2:
Load all Glove Vectors temporarily, select the ones for the word in your train + add a “UNK” word with a randomly initialized vector which is the default for words in your test set that don’t appear in train. But this sucks for GloVe vectors that already exist for these words.
Approach 3:
Load all GloVe vectors permanently + add a “UNK” like above for any word that appears for something that isn’t in the GloVe set.
Really looking for some insights!