Using the whole GloVe pre-trained embedding matrix or minimize the matrix based on the number of words in vocabulary

Nikos_Spanos · January 10, 2023, 12:06pm

I have created a neural network for sentiment analysis using bidirectional LSTM layers and pre-trained GloVe embeddings.

During the training I noticed that the nn.Embedding layers with the freezed embedding weights uses the whole vocabulary of GloVe:

(output of the instantiated model object)
(embedding): Embedding(400000, 50, padding_idx=0)

Also the structure of the nn.Embedding layer:
self.embedding = nn.Embedding.from_pretrained(embedding_matrix, freeze=True, padding_idx=self.padding_idx)

, where embedding_matrix = glove_vectors.vectors object and glove_vectors = torchtext.vocab.GloVe(name='6B', dim=50) source

400,000 is the shape of glove_vectors object (meaning 400,000 pre-trained words in total).

Then I noticed that the training of the LSTM neural network took approximately 3 to 5 minutes per epoch. Which is quite too long for only 150,000 trainable parameters. And I was wondering if this had to do with the use of the whole embedding matrix with 400,000 words or it’s normal because of the bidirectional LSTM method.

Is it worth to create a minimized version of the GloVe embeddings matrix from the words that only exist in my sentences or using the whole GloVe embeddings matrix it does not affect the training performance?

ptrblck · January 10, 2023, 7:28pm

I would recommend to profile the actual workload and check which part of the training might be the slowest before changing the embedding or any other part.
To do so you could try to use the native profiler or e.g. Nsight Systems.

Nikos_Spanos · January 10, 2023, 7:45pm

Thanks for the suggestion will do. So, as a first opinion the whole GloVe embedding vocabulary may be used. It doesn’t affect the results or the performance of the training if custom embedding matrix with less words is used.