Merge two 2D tensors, into a 3D tensor

Elidor · December 1, 2020, 12:52pm

Hi everyone,

I have two 2D tensors of shape:

y = torch.Size([50, 61])   # (batch_size, max_len)
x = torch.Size([50, 800])  # (batch_size, n_lstm_units)

what I would like to get is a 3D tensor made like this:

z = (50, 61, 800)  # (batch_size, max_len, n_lstm_units)

how can I do?

What I tried to do was to increase the size of x and y via unsqueeze(), resulting in (50, 61, 1) and (50, 1, 800). The problem is that afterwards I don’t know how to join the tensors without concatenating the values but simply “transferring” a dimension to obtain (50, 61, 800) (the idea is not to concatenate the values, but only to “move” a dimension from one tensor to another … I don’t know if I get the idea).

KFrank · December 1, 2020, 6:48pm

Hello Elidor!

Elidor:

I have two 2D tensors of shape:
y = torch.Size([50, 61])   # (batch_size, max_len)
x = torch.Size([50, 800])  # (batch_size, n_lstm_units)
what I would like to get is a 3D tensor made like this:
z = (50, 61, 800)  # (batch_size, max_len, n_lstm_units)
how can I do?

I’m not sure what you’re asking here.

Please note that your x and y together have 50 * (61 + 800)
elements, while your desired z will have 50 * 61 * 800 elements,
a much larger number. So you can’t simply populate z with elements
of x and y.

I have no idea if this would be what you want, but you can make a
tensor of the desired shape by taking the batch outer product:

z = torch.bmm (x.unsqueeze (2), y.unsqueeze (1))

(Pytorch does have an outer-product function, torch.ger, torch.outer,
but it only works on vectors, hence the unsqueeze(), above.)

Best.

K. Frank

Elidor · December 2, 2020, 10:39am

Hello @KFrank !
Thank you for your answer. I try to explain a little better what my problem is because maybe I haven’t been very detailed.

At the beginning of my code, I have a tensor with shape x = torch.Size([50, 61, 140]) (batch_size, seq_len, embedding_dim) and a ndarray x_len = (50,). These two vectors I give them as input to a bilstm whose code is the following (I’ll save you all the forward function, I don’t think it’s important):

x = nn.utils.rnn.pack_padded_sequence(x, x_len, batch_first=True)
out1 = self.lstm1(x)
x, lengths = nn.utils.rnn.pad_packed_sequence(out1, batch_first=True)

After this code I have two tensors with shape x = ([50, 61, 800]) and x_len = ([50]).
I give these two vectors as input to a module to calculate the attention in the following way (x is the parameter corresponding to inputs and x_len is the one corresponding to lengths):

def forward(self, inputs, lengths):
    batch_size, max_len = inputs.size()[:2]  # batch_size = 50, max_len = 61

    # matrix mult
    # apply attention layer
    weights = torch.bmm(inputs,
                        self.att_weights  # (1, hidden_size)
                        .permute(1, 0)  # (hidden_size, 1)
                        .unsqueeze(0)  # (1, hidden_size, 1)
                        .repeat(batch_size, 1, 1)  # (batch_size, hidden_size, 1)
                        )

    # weights.shape = torch.Size([50, 61, 1])
    attentions = torch.softmax(F.relu(weights.squeeze()), dim=-1)   # torch.Size([50, 61])

    # create mask based on the sentence lengths
    mask = torch.ones(attentions.size(), requires_grad=True).cuda()  # torch.Size([50, 61])
    for i, l in enumerate(lengths):  # skip the first sentence
        if l < max_len:
            mask[i, l:] = 0

    # apply mask and renormalize attention scores (weights)
    masked = attentions * mask
    _sums = masked.sum(-1).unsqueeze(-1)  # sums per row    # torch.Size([50, 1])

    attentions = masked.div(_sums)   # torch.Size([50, 61])

    # apply attention weights
    weighted = torch.mul(inputs, attentions.unsqueeze(-1).expand_as(inputs))   # torch.Size([50, 61, 800])

    # get the final fixed vector representations of the sentences
    representations = weighted.sum(1).squeeze()   # ([50, 800])

    return representations, attentions   # representations =  ([50, 800]) , attentions = ([50, 61])

What I would like to obtain as the output of this forward function is a tensor with shape ([50, 61, 800]) instead of two tensors made in that way.

Do you think there is a way to do it without “affecting” the attention calculation?

Thank you so much,

best regards.

KFrank · December 3, 2020, 3:41am

Hello Elidor!

I don’t have any advice for you, as I don’t have any intuition about what
these values might mean.

Perhaps if you could give a very high level description of your use case,
some expert in that area might have suggestions. Are you working on
some sort of (semi-) standard model? Is this a research direction that
others might have experience with?

Let me also reiterate that, at a lower level, a core issue is the amount
of “information” you have. If you want to turn [50, 61] and [50, 800]
into [50, 61, 800], you either have to get the additional information
from somewhere else, or your [50, 61, 800] has to encode the limited
information you have in some redundant way (such as using the outer
product I mentioned earlier).

Good luck.

K. Frank

Elidor · December 7, 2020, 4:07pm

Perhaps if you could give a very high level description of your use case,
some expert in that area might have suggestions. Are you working on
some sort of (semi-) standard model? Is this a research direction that
others might have experience with?

Ok, it seems like a good idea, so i try.
I started studying this model for dependency parsing. Its results are good:

Model(
  (dropout): Dropout(p=0.6, inplace=False)
  (word_embedding): Embedding(25413, 100, padding_idx=0)
  (tag_embedding): Embedding(20, 40, padding_idx=0)
  (bilstm): LSTM(908, 600, num_layers=3, batch_first=True, dropout=0.3, bidirectional=True)
  (bilstm_to_hidden1): Linear(in_features=1200, out_features=500, bias=True)
  (hidden1_to_hidden2): Linear(in_features=500, out_features=150, bias=True)
  (hidden2_to_pos): Linear(in_features=150, out_features=101, bias=True)
  (hidden2_to_dep): Linear(in_features=300, out_features=47, bias=True)
)

The bilstm input is formed by:

# INPUT: x = torch.Size([50, 61, 140]) and x_lengths = (50,)
x = torch.nn.utils.rnn.pack_padded_sequence(x, x_lengths, batch_first=True)
x, _ = self.bilstm(x)
x, _ = torch.nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
# OUTPUT: x = torch.Size([50, 61, 800]) --> (batch_size, seq_len, n_lstm_units)
x = x.contiguous()
x = x.view(-1, x.shape[2])
# OUTPUT: x = torch.Size([3050, 800]) --> (batch_size * seq_len, n_lstm_units)  
etc.

What i would like to do is add attention to the lstm (bilstm), so I defined a new model.

NewModel(
   THE PREVIOUS PART IS THE SAME AS THE ORIGINAL MODEL
  (bilstm): MyLSTM(
    (dropout): Dropout(p=0.3, inplace=False)
    (lstm1): LSTM(908, 600, num_layers=3, batch_first=True, dropout=0.3, bidirectional=True)
    (atten1): Attention()
    (lstm2): LSTM(1200, 600, num_layers=3, batch_first=True, dropout=0.3, bidirectional=True)
    (atten2): Attention()
  )
  THE NEXT PART IS THE SAME AS THE ORIGINAL MODEL

at this point my problem is to be able to report the output dimensions of the bilstm, after applying the attention, as those of the original model, so that the remaining part of the model can work.
The bilstm input is formed by:

# INPUT: x = torch.Size([50, 61, 140]) and x_len = (50,)
x = nn.utils.rnn.pack_padded_sequence(x, x_len, batch_first=True)
out1, (h_n, c_n) = self.lstm1(x)
x, lengths = nn.utils.rnn.pad_packed_sequence(out1, batch_first=True)
# OUTPUT: x = torch.Size([50, 61, 800]) --> (batch_size, seq_len, n_lstm_units)
# OUTPUT: lenghts: x = torch.Size([61])
x, att1 = self.atten1(x, lengths)  # skip connect
# OUTPUT: x = Torch.Size(50, 800) and att1 = Torch.Size(50, 61)

out2, (h_n, c_n) = self.lstm2(out1)
y, lengths = nn.utils.rnn.pad_packed_sequence(out2, batch_first=True)
y, att2 = self.atten2(y, lengths)
# OUTPUT: y = Torch.Size(50, 800) and att2 = Torch.Size(50, 61)

z = torch.cat([x, y], dim=1)
return z  # torch.Size([64, 1600])

So at the same point, after the forward() of the bilstm of the original model and after the forward() of the bilstm with attention I get two different results, respectively: torch.Size([3050, 800]) and torch.Size([64, 1600]). The difference comes from the fact that in the first model it is done (batch_size * seq_len, n_lstm_units) = (50 * 61, 800) while in the second model the results of x, att1 are the same dimension of y, att2 that is (50, 800) and (50, 61), respectively.

Anyone have any idea what i might try to do?
Is there a “standard” way to implement a self-attention module for a bilstm? I’ll put below the code I used if maybe it can be useful to better understand the problem …

class Attention(nn.Module):    

    def __init__(self, hidden_size, batch_first=False):
        super(Attention, self).__init__()

        self.hidden_size = hidden_size
        self.batch_first = batch_first

        self.att_weights = nn.Parameter(torch.Tensor(1, hidden_size), requires_grad=True)

        stdv = 1.0 / np.sqrt(self.hidden_size)

        for weight in self.att_weights:
            nn.init.uniform_(weight, -stdv, stdv)

    def forward(self, inputs, lengths):
        if self.batch_first: 
            batch_size, max_len = inputs.size()[:2]
        else:
            max_len, batch_size = inputs.size()[:2]

        # matrix mult
        # apply attention layer
        weights = torch.bmm(inputs,
                            self.att_weights  # (1, hidden_size)
                            .permute(1, 0)  # (hidden_size, 1)
                            .unsqueeze(0)  # (1, hidden_size, 1)
                            .repeat(batch_size, 1, 1)  # (batch_size, hidden_size, 1)
                            )

        attentions = torch.softmax(F.relu(weights.squeeze()), dim=-1)

        # create mask based on the sentence lengths
        mask = torch.ones(attentions.size(), requires_grad=True).cuda()
        for i, l in enumerate(lengths):  # skip the first sentence
            if l < max_len:
                mask[i, l:] = 0

        # apply mask and renormalize attention scores (weights)
        masked = attentions * mask
        _sums = masked.sum(-1).unsqueeze(-1)  # sums per row

        attentions = masked.div(_sums)

        # apply attention weights
        weighted = torch.mul(inputs, attentions.unsqueeze(-1).expand_as(inputs))

        # get the final fixed vector representations of the sentences
        representations = weighted.sum(1).squeeze()

        return representations, attentions

class MyLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, batch_first, bidirectional, dropout):
        super(MyLSTM, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        self.lstm1 = nn.LSTM(input_size=input_size,
                             hidden_size=hidden_size,
                             num_layers=num_layers,
                             batch_first=batch_first,
                             bidirectional=bidirectional,
                             dropout=dropout)
        self.atten1 = Attention(hidden_size * 2, batch_first=batch_first)  # 2 is bidrectional
        self.lstm2 = nn.LSTM(input_size=hidden_size * 2,
                             hidden_size=hidden_size,
                             num_layers=num_layers,
                             batch_first=batch_first,
                             bidirectional=bidirectional,
                             dropout=dropout)
        self.atten2 = Attention(hidden_size * 2, batch_first=batch_first)

    def forward(self, x, x_len):
        x = nn.utils.rnn.pack_padded_sequence(x, x_len, batch_first=True)
        out1, (h_n, c_n) = self.lstm1(x)
        x, lengths = nn.utils.rnn.pad_packed_sequence(out1, batch_first=True)
        x, att1 = self.atten1(x, lengths)  # skip connect

        out2, (h_n, c_n) = self.lstm2(out1)
        y, lengths = nn.utils.rnn.pad_packed_sequence(out2, batch_first=True)
        y, att2 = self.atten2(y, lengths)

        z = torch.cat([x, y], dim=1)
        return z