I read quite a lot about the importance of word embedding in the context of NLP, but i’ve never seen the following issue beeing adresed : Are pre-trained embeddings (word2vec, GloVe etc…) performing better or worse than an embedding layer trained along with the model ?
I intuitively would think that an embedding layer trained along with the model should perform better since it’s trained task-specifically by the network during back-propagation (and maybe generalize less easily ?) but didn’t really find much info on this.
I’m aware that there is probably no clear answer to this, and that this is definitely case dependant, but i’d emjoy hearing your feedback on this matter.
I’m not an expert on NLP, but I would assume pretrained embeddings might be treated similar to pretrained kernels in a CNN.
If you are dealing with data from the same or similar domain (e.g. English texts / “natural” images), the pretrained parameters might just work and probably even better than training from scratch, e.g. if you are working with a small dataset.
That being said, if your data comes from another domain (e.g. source code / medical images), the pretrained layers might not work pretty well and you would need to finetune them or train form scratch.
In the case of NLP and embeddings I would assume this use case to fail pretty badly, since embeddings are used as a lookup table for each word index.
If the words you are dealing with are not in the dictionary, you might get an “unknown token” for a lot of words, which might make the embedding useless.
To add a bit more details to @ptrblck’s answer from an NLP point of view: As you’ve rightly pointed out, there’s no clear answer, and the decision is generally task dependent. It essentially boils down to the semantic of word embeddings, and if they are meaningful in the context of your task.
The most straightforward example is sentiment analysis. In Word2Vec and Glove, to word vectors are similar if the respective words are often used in similar contexts. That means, the words like
beautiful may have similar word vectors since both words are used to describe appearances. These word embeddings generally do not capture the “polarity” of words. That’s not just a guess, but I’ve noticed this first hand when I trained a Word2Vec model myself, and checked the most similar words of adjectives.
I worked on a project where we compared GloVe embeddings vs. an Embedding layer for one specific NLP classification problem. We got inconclusive results. Briefly, both approaches gave similar error and accuracy results. In the end, we used an Embedding layer because it was simpler to work with (and the extra training time was manageable).