Train LSTM with very large vocabulary

FrancescoMandru · February 5, 2020, 4:21pm

I’m training a LSTM with a very large corpus. The LSTM has a pre-trained matrix of embeddings with 120000 words, obtained with word2vec. The training set is composed by 900000 training phrases of length 30 each and the validation has 270000 samples. The problem is that when I run this on Google Colab with the maximum RAM and the powerful GPU I get CUDA out of memory. I reduced the batch size to 128 but i think that most of the memory is occupied by the network and vocabulary, hence I’m asking if there are some tricks when you work with very large datasets and networks.

ebarsoum · February 5, 2020, 7:14pm

Can you use Sparse for the embeddngs (https://pytorch.org/docs/stable/sparse.html) or put the embeddings on host memory.

vdw · February 8, 2020, 6:33am

Sorry, not a solution, but may I ask what your task is? Classification? Translation? 120k is quite a large vocabulary. Training an RNN is tricky enough, so always try to minimize the vocabulary as much as possible to make it easier for the network. Some common considerations/steps:

What’s the distribution of words? I would hazard a guess that the top 10k words might already cover 95% of your content. For classification, that’s probably already enough.
What do you do for preprocessing? I often work with informal text. While I’m fine with “lol”, I don’t want it to happen that “loool” or “loooool” count as different words. Such noise can blow up a vocabulary quickly.
Do you have many named entities (i.e., names of persons, locations, organizations, etc.) or numbers in your text? In case of translation, such “special word” won’t get translated anyway just copied. So a common approach is to first replace such words with a special token (e.g., “I met Alice in London” to "I met in ") do the translation and then add the names back in.

Just some food for thought.