Yes I know, and I mentioned in my first reply that it will be slower. Of course EmbeddingBag is more efficient, otherwise it won’t even exist. I’m just saying that using Embedding for this is likely not “too slow” and probably not a huge performance bottleneck for your particular use case.
If you are really concerned, you can run some tests and measure the time difference. If that is crucial to you, you can make a github request and see if anyone wants to implement, or you can do it yourself.
If I want to achieve the function of nn.Embeddingbag with nn.Embedding, is padding the only way I can do?
Assuming the followings are inputs to nn.Embeddingbag:
idx = [0,3,2,4,5,9,23] offset = [0,2,4]
In this case, the actual input to nn.Embedding is [[0,3],[2,4],[5,9,23]].
So to make a batch from this input, do I need to pad?
I think i can increase the indices by 1 and pad 0 up to the max length of inputs, which is 3 in this case. And I can use padding_idx option in nn.Embedding.
However, my concern is that if the max length is large, then I would have to store a lot of 0s, which will be very memory inefficient.
Is there any way that I can achieve this efficiently?
I doubt it would be too bad with respect to storing 0s. Unlike tensorflow you only have to pad up to the max length of the longest example in the batch. If it were really important to not pad too much for each batch you could sort your examples by length then create your batches. This is one common technique used to make RNN models more efficient
That being said, in at least the NLP models where I have used embeddings/embedding bags that part is quite fast compared to the rest of the model. I would recommend implementing the model, and if it is still too slow then optimize rather than worry about it right away