Embedding vs one hot representation

nivter · March 11, 2018, 2:01pm

The Pytorch tutorial for seq to seq network and attention uses word embeddings instead of one-hot representations as inputs to the LSTM network.

My question is: doesn’t using embeddings affect the training or performance of the model? Two words (with completely different semantics) may have embeddings with high cosine similarity whereas one-hot representations have zero cosine similarity. How does the network deal with it?

jpeg729 · March 13, 2018, 9:04am

The embeddings are trainable, so if two dissimilar words get similar embeddings, then the embedding layer should learn to separate their embeddings.

nivter · March 14, 2018, 2:47am

Thanks. Won’t it prolong the training though? Since the number of learn-able parameters is much more in this case.

jpeg729 · March 14, 2018, 8:32am

Maybe, but then again maybe improving the embeddings pays off by giving the rest of the model a more useful baseline.