Effect of vocabulary size on GPU memory - Slow Training

vdw · April 21, 2019, 5:11am

There are no golden rules. It always depends on the task and the data.

But yeah, initialize out-of-vocabulary (OOV) words with random is a common approach, but each with a different random embedding vector. The same for all OOV words seems counter-intuitive since the model still wants to distinguish between the words.

Whether you set require_grad=True or require_grad=False is also up to you. Just try both approaches. If you want to distinguish between OOV words and non-OOV words regarding require_grad, I might have to define 2 embedding layers, one for each, where you can set require_grad separately. I’m pretty sure there was a corresponding thread not too long ago.

You might want to check how many OOV words you have (e.g., 2%) and how the look like. If they are not many and most are of no relevance (e.g., “hahahahaha” for document classification), don’t bother much and go with random, trainable or not. Of course, if you have say many topic-specific words, say, chemical compounds or biological names, then you might need to be a bit more careful.

From my experience, I would put too much importance into pretrained word embeddings, at least not in the beginning. For some task, they might even be counter-productive. At some point, you simply have to try different setting and see what works best for you scenario.