I have been working on implementing an encoder-decoder NLP model and recently I switched from using learnable embeddings with a vocabulary of ~20,000 (all the words in my dataset) to using GloVe pre-trained embeddings with vocabulary size ~400,000 (I freeze the embedding layer).
This change has been problematic as with the larger vocabulary size now the following tensors become much larger, especially the ones in bold:
- Embedding Layer (vocab_size, embed_size)
- Decoder fully connected layer weights (hiddenstate_dim, vocab_size)
This layer transforms the hidden state of the LSTM to logits over the whole vocabulary, then they are softmaxed to get a probability distribution.
- Decoder output after fully connected layer (batch_size, max_seq_length, vocab_size)
- Decoder output after softmax (batch_size, max_seq_length, vocab_size)
Tensors created during the backward pass
I checked the cuda memory allocated and the memory change before and after loss.backward() is very large.
This means that during the forward and backward passes, memory allocations are too high to have large batch sizes (100-300), I can only go up to 15-20 and can’t train efficiently anymore.
My questions are:
- Is this something that makes sense and that is to be expected with large embeddings?
- Am I thinking of using pre-trained embeddings right? As in, my training dataset corpus has vocab size 20,000 but the GloVe pre-trained embeddings I intend to use have 400,000 tokens. What is the standard way when using pre-trained embeddings? To use the entire weight matrix or to still only use the 20,000 embedding layer with only the words in my dataset corpus but initialise these with the relevant pre-trained GloVe-embeddings?
By the way, when using the pre-trained embeddings, I freeze the embedding layer so sparse=True does not solve anything.
Maybe I misunderstand something wrong, but I think nothing should change when using pretrained word embeddings. When your dataset has a vocabulary of size 20k, you only need the 20k respective word embeddings.
An embedding layer is essentially just a look-up layer. So there is no need for embeddings that are never looked up :). Just create a embedding matrix of size
(vocab_size, embed_dim) – e.g.,
(20000, 300) – and fill the 20k embeddings using the pretrained ones.
Thanks, this is what I was confused about and makes sense now!
So in the case of words which are present in my dataset, but not in the GloVe embeddings, what is standard?
- fill all of these as <unk> (randomly initialised, but all the same)
- fill each of these with a different random vector
- fill each of these with a different random vector and somehow allow for them to be trained
If (3), would setting require_grad=True for all the rows in the embedding layer weights which correspond to these work? Or is there another way you’d recommend?
There are no golden rules. It always depends on the task and the data.
But yeah, initialize out-of-vocabulary (OOV) words with random is a common approach, but each with a different random embedding vector. The same for all OOV words seems counter-intuitive since the model still wants to distinguish between the words.
Whether you set
require_grad=False is also up to you. Just try both approaches. If you want to distinguish between OOV words and non-OOV words regarding
require_grad, I might have to define 2 embedding layers, one for each, where you can set
require_grad separately. I’m pretty sure there was a corresponding thread not too long ago.
You might want to check how many OOV words you have (e.g., 2%) and how the look like. If they are not many and most are of no relevance (e.g., “hahahahaha” for document classification), don’t bother much and go with random, trainable or not. Of course, if you have say many topic-specific words, say, chemical compounds or biological names, then you might need to be a bit more careful.
From my experience, I would put too much importance into pretrained word embeddings, at least not in the beginning. For some task, they might even be counter-productive. At some point, you simply have to try different setting and see what works best for you scenario.
These were very helpful insights Chris, thanks for taking the time! Marked as solved.
Hi, using pre-trained embedding might solve the problem with the input layer. But what about the final or output layer(Fully connected layer) where you have mentioned to have larger parameters?