Which will be faster, retrieving each word vector from the word embedding dictionary or from the nn.Embedding()?

Mainul · November 12, 2019, 10:39am

For getting embeddings of a sequence, if I use python dictionary, I have to hit the dictionary for each word but if I copy the vectors in nn.Embedding(), for each sequence I have to send the indices of the word and even I can get word embeddings of several sequences just by one call, for example:

input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
embedding(input)`

But, pytorch embedding uses numpy matrix as lookup table.
So, my question is, isn’t for executing the previous code, there will be 7 hits on the numpy matrix? Or it will be retrieved parallelly? Even it runs parallelly, there should be another dictionary to convert word to indices. That also needs 7 hits on the word2indices dictionary.

These questions and issues I am asking because of my original question,
Which will be a faster process ?:
For a sequence:

getting each word vector from the word vector dictionary
Get the indices from the word2indices dictionary and the run like the mentioned code.

albanD · November 12, 2019, 10:59pm

Hi,

We don’t use numpy objects to store things, only pytorch’s Tensors.
The good think about indexing is that a single indexing operation will return all the concatenated results, while you would have to do it by hand if you use a python dictionnary.