Embedding 3-D data

jef · November 9, 2017, 3:16pm

Hi. I want to implement character-level embedding.

This is usual word embedding.

Word Embedding

    Input: [ [‘who’, ‘is’, ‘this’] ] 
    -> [ [3, 8, 2] ]     # (batch_size, sentence_len)
    -> // Embedding(Input)
     # (batch_size, seq_len, embedding_dim)

This is what i want to do.
2. Character Embedding

Input: [ [ [‘w’, ‘h’, ‘o’, 0], [‘i’, ‘s’, 0, 0], [‘t’, ‘h’, ‘i’, ‘s’] ] ]
-> [ [ [2, 3, 9, 0], [ 11, 4, 0, 0], [21, 10, 8, 9] ] ]      # (batch_size, sentence_len, word_len)
-> // Embedding(Input) # (batch_size, sentence_len, word_len, embedding_dim)
-> // sum each character embeddings  # (batch_size, sentence_len, embedding_dim)
The final output shape is same as Word embedding. Because I want to concat them later.

Although I tried, I am not sure how to implement 3-D embedding. Do you know how to implement such a data?

    def forward(self, x):
        print('x', x.size()) # (N, seq_len, word_len)
        bs = x.size(0)
        seq_len = x.size(1)
        word_len = x.size(2)
        embd_list = []
        for i, elm in enumerate(x):
            tmp = torch.zeros(1, word_len, self.embd_size)
            for chars in elm:
                tmp = torch.add(tmp, 1.0, self.embedding(chars.unsqueeze(0)))

Above code got an error because output of self.embedding is Variable.

TypeError: torch.add received an invalid combination of arguments - got (torch.FloatTensor, float, Variable), but expected one of:
 * (torch.FloatTensor source, float value)
 * (torch.FloatTensor source, torch.FloatTensor other)
 * (torch.FloatTensor source, torch.SparseFloatTensor other)
 * (torch.FloatTensor source, float value, torch.FloatTensor other)
      didn't match because some of the arguments have invalid types: (torch.FloatTensor, float, Variable)
 * (torch.FloatTensor source, float value, torch.SparseFloatTensor other)
      didn't match because some of the arguments have invalid types: (torch.FloatTensor, float, Variable)

Update

I can do this with for. But it may be slow.

    def forward(self, x):
        print('x', x.size()) # (N, seq_len, word_len)
        bs = x.size(0)
        seq_len = x.size(1)
        word_len = x.size(2)
        embd = Variable(torch.zeros(bs, seq_len, self.embd_size))
        for i, elm in enumerate(x): # every sample
            for j, chars in enumerate(elm): # every sentence. [ [‘w’, ‘h’, ‘o’, 0], [‘i’, ‘s’, 0, 0], [‘t’, ‘h’, ‘i’, ‘s’] ]
                chars_embd = self.embedding(chars.unsqueeze(0)) # (N, word_len, embd_size) [‘w’,‘h’,‘o’,0]
                chars_embd = torch.sum(chars_embd, 1) # (N, embd_size). sum each char's embedding
                embd[i,j] = chars_embd[0] # set char_embd as word-like embedding

        x = embd # (N, seq_len, embd_dim)

alexis-jacq · November 9, 2017, 4:25pm

First, avoid “for” loops in your forward. You want to use tensor operations that are calling C code.

If I understand embedding correctly, with N dimensions and dictionary of size S, you want to create S maps associating each integer from the input with a unique float.

Then, torch.gather() is your best friend

# create an example input with 30 possible characters, 
# 3 batchs, sentence_len = 7, word_len = 5
x = torch.floor(torch.rand(3,7,5)*30).long()

# create embedding weights with 4 dictionnaries for 30 characters
# maping to [-1,1] segment
w = 1-2*torch.rand(30,4,1,1,1)

# get indices ready to pick values in w:
idx = torch.cat([x.unsqueeze(0)]*4,0).unsqueeze(0) # (1,4,3,7,5)
# align w with indices and pick values from w
# and squeeze the fake dimention used for aligment:
x_emb = w.expand_as(30,4,3,7,5).gather(0,idx).squeeze() # (4,3,7,5)

I think x_emb is what you are looking for

jef · November 9, 2017, 4:51pm

Thank you @alexis-jacq But your code got an error. And I updated my post using for.

TypeError: expand_as() takes 2 positional arguments but 6 were given

And I am not sure about torch.rand(30,4,1,1,1). 4 means embedding size? And what are last three 1?

richard · November 9, 2017, 5:21pm

Try using expand instead of expand_as

jef · November 9, 2017, 6:02pm

Still not clear for me…And still an error using expand insted of expand_as which is from gather.
Are there any way to implement Embedding? I would like to use nn.Embedding layer.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-187-593f47b27cfe> in <module>()
     15 # align w with indices and pick values from w
     16 # and squeeze the fake dimention used for aligment:
---> 17 x_emb = w.expand(30,4,3,7,5).gather(0,idx).squeeze() # (4,3,7,5)
     18 print('x_embd', x_emb.size())

RuntimeError: Expected tensor [1 x 4 x 1 x 7 x 5], src [30 x 4 x 3 x 7 x 5] and index [1 x 4 x 1 x 7 x 5] to have the same size in dimension 0 at /pytorch/torch/lib/TH/generic/THTensorMath.c:445

alexis-jacq · November 10, 2017, 9:32am

Ok I written too fast, it’s expand and not expand_as.

I tested this on my machin and it run without any bug:

 x = torch.floor(torch.rand(3,7,5)*30).long()
 w = 1-2*torch.rand(30,4,1,1,1)
 idx = torch.cat([x.unsqueeze(0)]*4,0).unsqueeze(0)
 x_emb = w.expand(30,4,3,7,5).gather(0,idx).squeeze()

With the bug you have, it looks you tried with batch_size = 1, while you did not change the batch value in the gather line

jpzhangvincent · January 22, 2018, 7:11pm

Did you figure out? I’m also curious how to use nn.Embedding for variable length of multiple sentences in a batch?