I am trying to create a bilingual LSTM that is trained on a bilingual corpus in English and Spanish. For monolingual datasets in English, we only need to use a pre-trained English embedding like fasttext to get the corresponding word vectors from the ids from the tokenized text. For example, for the sentence “I like dog”, I could tokenize them into [“i”, “like”, “dog”], get the corresponding ID to retrieve the vectors.
However, I don’t think we can do that for bilingual datasets, so I need a bilingual embedding in English and Spanish. For example, for the sentence “I like dog, me gusta el perro.”, I could tokenize them into [“i”, “like”, “dog”, “me”, “gusta”, “el”, “perro”]. But, I don’t know what to do next.
However, another problem arises: I have already looked at commonly used multilingual embedding techniques like MUSE. However, I am not quite sure how to incorporate them into my model. Another approach I have in mind is to create a joint vocab in English and Spanish and train an embedding from scratch based on that.
So, my questions are (1) How can I use multilingual embedding in English and Spanish from MUSE to process the sentence “I like dog, me gusta el perro.” in the example, and (2) Does create a joint vocab embedding work for this problem?