Variable length indexing of embeddings without padding

flyaway · September 7, 2018, 4:46pm

Hi, there,

I am training word embeddings with our new model.
The problem I am facing is my input indexs are in different length.
For example, the input index might be [LongTensor([1,2,3]), LongTensor([5,8])].
Because Embedding class only accept LongTensor, I can not pass in a List.
After some searching, I have two possible solutions:

Using padding index. However, this method requires a large amount of cuda memory because my index is actually very sparse. For example, the total number of possible index might be around 100000, while each instance only have 100 activated index. So, in this case, padding is not memory efficient.
Go through the list one by one to index the embedding. This method can work. But this method is too slow. It is at least 10 times slower than batch.

Is there any third method that can resolve my situation?

TheShadow29 · September 8, 2018, 12:06am

Have you looked into pad packed sequence and pack padded sequence? This blog post https://medium.com/huggingface/understanding-emotions-from-keras-to-pytorch-3ccb61d5a983 has a section describing packed sequences which might be what you are looking for unless I am misunderstanding your question.

flyaway · September 8, 2018, 1:00am

Packed sequence can not be used in indexing Embedding.
It only can be used for RNN or LSTM.
Embedding can only be indexed by LongTensor I think.

TheShadow29 · September 8, 2018, 1:25am

You are correct that packed sequence is for lstm.

If you have 100000 indices, how will you back propogate through time? I would use some kind of a sequence window (say 10), and pass at most 10 inputs at a time. You could also do it like in the awd-lstm paper where the bptt window is random (see https://arxiv.org/pdf/1708.02182.pdf for more details).

Not sure if this answers your question though.

flyaway · September 8, 2018, 1:50am

Thanks for the replying.
Perhaps I have not described my problem correctly.
Let me try again.
I have a list of indices, like A = [[1,2,3], [2,3,4,5,7],…], each of indices has different length.
Now I want to use this 2-D indices to index Embeddings.
For example, Embedding(A[0]) means I want three embeddings at position 1,2,3; Embedding(A[1]) means I want four embeddings at position 2,3,4,5,7.
But, I can not Embedding(A) directly, because each A[i] has different length.

If we use padding, padding(A)=B=[[1,2,3, 0,0], [2,3,4,5,7]], then we can use Embedding(B) to index all the embedding I need (0 is the padding index).
However, the max length of A[i] can be huge, i.e. 1000000.
If we still use padding, it would waste most of the memory.

And in my model, len(A) = 20000 and I need to compute Embedding(A) in every batch.
When I use small dataset, padding is totally fine.
But when I use large dataset, most of the cuda memory is wasted.

My model does not use LSTM or RNN. I do not need bptt here …

Hope I explained clearly.

TheShadow29 · September 8, 2018, 3:24am

Hmm. I see I completely missed the point (probably because I am playing lstms atm haha).

I don’t see any easy solution, unfortunately. One way could be first sort your data in order of length. Make shuffle=False in the Dataloader. Say for the first 100 indices you only have sequences of length max 4, so you can adjust your Dataset output accordingly. Not sure if there are any easier or better approaches available.

flyaway · September 8, 2018, 3:37am

Thanks for the help.
I will try your method.

sgaseretto · July 30, 2019, 12:42am

Hello @flyaway, any luck solving this problem? I’m interested in doing the exact same thing

Sum_Sum · August 1, 2021, 5:34pm

In case you are interested in finding the mean/max/sum embedding of each row of embeddings (that can be of varying lengths) an EmbeddingBag may give you what you want.
https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html