Best Practice for Glove Embeddings to avoid Data Leakage?

Daniel_Dsouza · November 28, 2018, 7:29pm

Hi All!
I’m working on an NLP problem, and was wondering what the best practice was w.r.t GloVe/Pre-trained embeddings. Obviously we’re trying to eliminate data leakage.

Approach 1:
Load all Glove Vectors temporarily, select the ones for the words in your train+test set. Add a “UNK” randomly initialized vector for words in your train+test that don’t appear in Glove Set

Approach 2:
Load all Glove Vectors temporarily, select the ones for the word in your train + add a “UNK” word with a randomly initialized vector which is the default for words in your test set that don’t appear in train. But this sucks for GloVe vectors that already exist for these words.

Approach 3:
Load all GloVe vectors permanently + add a “UNK” like above for any word that appears for something that isn’t in the GloVe set.

Really looking for some insights!

ptrblck · November 28, 2018, 9:26pm

Hi Daniel,

I think tagging people is not the best way to get the best answers, as this might demotivate other users to post an answer or, like in this case, the tagged person might have no idea about your problem.
Could you please remove the tags?

Unfortunately, I’m not really that familiar with NLP use cases, so let’s wait for some NLP experts.

Daniel_Dsouza · November 28, 2018, 9:40pm

Absolutely. Sorry! I’m new to the forums!