# Merge two 2D tensors, into a 3D tensor

Hi everyone,

I have two 2D tensors of shape:

``````y = torch.Size([50, 61])   # (batch_size, max_len)
x = torch.Size([50, 800])  # (batch_size, n_lstm_units)
``````

what I would like to get is a 3D tensor made like this:

``````z = (50, 61, 800)  # (batch_size, max_len, n_lstm_units)
``````

how can I do?

What I tried to do was to increase the size of `x` and `y` via `unsqueeze()`, resulting in `(50, 61, 1)` and `(50, 1, 800)`. The problem is that afterwards I don’t know how to join the tensors without concatenating the values but simply “transferring” a dimension to obtain `(50, 61, 800)` (the idea is not to concatenate the values, but only to “move” a dimension from one tensor to another … I don’t know if I get the idea).

Hello Elidor!

I’m not sure what you’re asking here.

Please note that your `x` and `y` together have `50 * (61 + 800)`
elements, while your desired `z` will have `50 * 61 * 800` elements,
a much larger number. So you can’t simply populate `z` with elements
of `x` and `y`.

I have no idea if this would be what you want, but you can make a
tensor of the desired shape by taking the batch outer product:

``````z = torch.bmm (x.unsqueeze (2), y.unsqueeze (1))
``````

(Pytorch does have an outer-product function, torch.ger, torch.outer,
but it only works on vectors, hence the `unsqueeze()`, above.)

Best.

K. Frank

Hello @KFrank !
Thank you for your answer. I try to explain a little better what my problem is because maybe I haven’t been very detailed.

At the beginning of my code, I have a tensor with shape `x = torch.Size([50, 61, 140])` `(batch_size, seq_len, embedding_dim)` and a ndarray `x_len = (50,)`. These two vectors I give them as input to a bilstm whose code is the following (I’ll save you all the forward function, I don’t think it’s important):

``````x = nn.utils.rnn.pack_padded_sequence(x, x_len, batch_first=True)
out1 = self.lstm1(x)
``````

After this code I have two tensors with shape `x = ([50, 61, 800])` and `x_len = ([50])`.
I give these two vectors as input to a module to calculate the attention in the following way (`x` is the parameter corresponding to `inputs` and `x_len` is the one corresponding to `lengths`):

``````def forward(self, inputs, lengths):
batch_size, max_len = inputs.size()[:2]  # batch_size = 50, max_len = 61

# matrix mult
# apply attention layer
weights = torch.bmm(inputs,
self.att_weights  # (1, hidden_size)
.permute(1, 0)  # (hidden_size, 1)
.unsqueeze(0)  # (1, hidden_size, 1)
.repeat(batch_size, 1, 1)  # (batch_size, hidden_size, 1)
)

# weights.shape = torch.Size([50, 61, 1])
attentions = torch.softmax(F.relu(weights.squeeze()), dim=-1)   # torch.Size([50, 61])

# create mask based on the sentence lengths
for i, l in enumerate(lengths):  # skip the first sentence
if l < max_len:

# apply mask and renormalize attention scores (weights)
_sums = masked.sum(-1).unsqueeze(-1)  # sums per row    # torch.Size([50, 1])

attentions = masked.div(_sums)   # torch.Size([50, 61])

# apply attention weights
weighted = torch.mul(inputs, attentions.unsqueeze(-1).expand_as(inputs))   # torch.Size([50, 61, 800])

# get the final fixed vector representations of the sentences
representations = weighted.sum(1).squeeze()   # ([50, 800])

return representations, attentions   # representations =  ([50, 800]) , attentions = ([50, 61])
``````

What I would like to obtain as the output of this forward function is a tensor with shape `([50, 61, 800])` instead of two tensors made in that way.

Do you think there is a way to do it without “affecting” the attention calculation?

Thank you so much,

best regards.

Hello Elidor!

I don’t have any advice for you, as I don’t have any intuition about what
these values might mean.

Perhaps if you could give a very high level description of your use case,
some expert in that area might have suggestions. Are you working on
some sort of (semi-) standard model? Is this a research direction that
others might have experience with?

Let me also reiterate that, at a lower level, a core issue is the amount
of “information” you have. If you want to turn `[50, 61]` and `[50, 800]`
into `[50, 61, 800]`, you either have to get the additional information
from somewhere else, or your `[50, 61, 800]` has to encode the limited
information you have in some redundant way (such as using the outer
product I mentioned earlier).

Good luck.

K. Frank

Perhaps if you could give a very high level description of your use case,
some expert in that area might have suggestions. Are you working on
some sort of (semi-) standard model? Is this a research direction that
others might have experience with?

Ok, it seems like a good idea, so i try.
I started studying this model for dependency parsing. Its results are good:

``````Model(
(dropout): Dropout(p=0.6, inplace=False)
(bilstm): LSTM(908, 600, num_layers=3, batch_first=True, dropout=0.3, bidirectional=True)
(bilstm_to_hidden1): Linear(in_features=1200, out_features=500, bias=True)
(hidden1_to_hidden2): Linear(in_features=500, out_features=150, bias=True)
(hidden2_to_pos): Linear(in_features=150, out_features=101, bias=True)
(hidden2_to_dep): Linear(in_features=300, out_features=47, bias=True)
)
``````

The bilstm input is formed by:

``````# INPUT: x = torch.Size([50, 61, 140]) and x_lengths = (50,)
x, _ = self.bilstm(x)
# OUTPUT: x = torch.Size([50, 61, 800]) --> (batch_size, seq_len, n_lstm_units)
x = x.contiguous()
x = x.view(-1, x.shape[2])
# OUTPUT: x = torch.Size([3050, 800]) --> (batch_size * seq_len, n_lstm_units)
etc.
``````

What i would like to do is add attention to the lstm (bilstm), so I defined a new model.

``````NewModel(
THE PREVIOUS PART IS THE SAME AS THE ORIGINAL MODEL
(bilstm): MyLSTM(
(dropout): Dropout(p=0.3, inplace=False)
(lstm1): LSTM(908, 600, num_layers=3, batch_first=True, dropout=0.3, bidirectional=True)
(atten1): Attention()
(lstm2): LSTM(1200, 600, num_layers=3, batch_first=True, dropout=0.3, bidirectional=True)
(atten2): Attention()
)
THE NEXT PART IS THE SAME AS THE ORIGINAL MODEL
``````

at this point my problem is to be able to report the output dimensions of the bilstm, after applying the attention, as those of the original model, so that the remaining part of the model can work.
The bilstm input is formed by:

``````# INPUT: x = torch.Size([50, 61, 140]) and x_len = (50,)
out1, (h_n, c_n) = self.lstm1(x)
# OUTPUT: x = torch.Size([50, 61, 800]) --> (batch_size, seq_len, n_lstm_units)
# OUTPUT: lenghts: x = torch.Size([61])
x, att1 = self.atten1(x, lengths)  # skip connect
# OUTPUT: x = Torch.Size(50, 800) and att1 = Torch.Size(50, 61)

out2, (h_n, c_n) = self.lstm2(out1)
y, att2 = self.atten2(y, lengths)
# OUTPUT: y = Torch.Size(50, 800) and att2 = Torch.Size(50, 61)

z = torch.cat([x, y], dim=1)
return z  # torch.Size([64, 1600])
``````

So at the same point, after the `forward()` of the bilstm of the original model and after the `forward()` of the bilstm with attention I get two different results, respectively: `torch.Size([3050, 800])` and `torch.Size([64, 1600])`. The difference comes from the fact that in the first model it is done `(batch_size * seq_len, n_lstm_units) = (50 * 61, 800)` while in the second model the results of `x, att1` are the same dimension of `y, att2` that is (50, 800) and (50, 61), respectively.

Anyone have any idea what i might try to do?
Is there a “standard” way to implement a self-attention module for a bilstm? I’ll put below the code I used if maybe it can be useful to better understand the problem …

``````class Attention(nn.Module):

def __init__(self, hidden_size, batch_first=False):
super(Attention, self).__init__()

self.hidden_size = hidden_size
self.batch_first = batch_first

stdv = 1.0 / np.sqrt(self.hidden_size)

for weight in self.att_weights:
nn.init.uniform_(weight, -stdv, stdv)

def forward(self, inputs, lengths):
if self.batch_first:
batch_size, max_len = inputs.size()[:2]
else:
max_len, batch_size = inputs.size()[:2]

# matrix mult
# apply attention layer
weights = torch.bmm(inputs,
self.att_weights  # (1, hidden_size)
.permute(1, 0)  # (hidden_size, 1)
.unsqueeze(0)  # (1, hidden_size, 1)
.repeat(batch_size, 1, 1)  # (batch_size, hidden_size, 1)
)

attentions = torch.softmax(F.relu(weights.squeeze()), dim=-1)

# create mask based on the sentence lengths
for i, l in enumerate(lengths):  # skip the first sentence
if l < max_len:

# apply mask and renormalize attention scores (weights)
_sums = masked.sum(-1).unsqueeze(-1)  # sums per row

# apply attention weights
weighted = torch.mul(inputs, attentions.unsqueeze(-1).expand_as(inputs))

# get the final fixed vector representations of the sentences
representations = weighted.sum(1).squeeze()

return representations, attentions

class MyLSTM(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, batch_first, bidirectional, dropout):
super(MyLSTM, self).__init__()
self.dropout = nn.Dropout(p=dropout)
self.lstm1 = nn.LSTM(input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=batch_first,
bidirectional=bidirectional,
dropout=dropout)
self.atten1 = Attention(hidden_size * 2, batch_first=batch_first)  # 2 is bidrectional
self.lstm2 = nn.LSTM(input_size=hidden_size * 2,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=batch_first,
bidirectional=bidirectional,
dropout=dropout)
self.atten2 = Attention(hidden_size * 2, batch_first=batch_first)

def forward(self, x, x_len):