How to create bilingual LSTM

nkn002 · October 20, 2022, 4:51am

I am trying to create a bilingual LSTM that is trained on a bilingual corpus in English and Spanish. For monolingual datasets in English, we only need to use a pre-trained English embedding like fasttext to get the corresponding word vectors from the ids from the tokenized text. For example, for the sentence “I like dog”, I could tokenize them into [“i”, “like”, “dog”], get the corresponding ID to retrieve the vectors.

However, I don’t think we can do that for bilingual datasets, so I need a bilingual embedding in English and Spanish. For example, for the sentence “I like dog, me gusta el perro.”, I could tokenize them into [“i”, “like”, “dog”, “me”, “gusta”, “el”, “perro”]. But, I don’t know what to do next.

However, another problem arises: I have already looked at commonly used multilingual embedding techniques like MUSE. However, I am not quite sure how to incorporate them into my model. Another approach I have in mind is to create a joint vocab in English and Spanish and train an embedding from scratch based on that.

So, my questions are (1) How can I use multilingual embedding in English and Spanish from MUSE to process the sentence “I like dog, me gusta el perro.” in the example, and (2) Does create a joint vocab embedding work for this problem?

J_Johnson · October 20, 2022, 3:44pm

Work through the tutorial examples:

https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

They will likely answer most of your questions.

Additionally, have a look at the paper “Attention is All You Need(2017)”. This might persuade you to consider Transformers instead of LSTMs. Pytorch has tutorials for those, too.