Crossattention problem

Gorgen · April 11, 2022, 1:30pm

Hello everyone:
sorry to disturb us. When I used crossattention, I found all of scores are 1. To be honest, I doubt the crossattention layer I constructed was wrong. Here are coding and scores and models. Could you please give me some advice to address these problem？
Thanks
best wishes

class DotProductAttention(nn.Module):
def init(self, key_size, num_hiddens, dropout, **kwargs):
super(DotProductAttention, self).init(**kwargs)

    self.key_size = key_size
    self.num_hiddens = num_hiddens
    self.dropout = dropout
    self.dropout = nn.Dropout(dropout)
    self.W_k = nn.Linear(key_size, num_hiddens, bias=False)

def forward(self, queries, keys, values):
    d = queries.shape[-1]
    queries = self.W_k(queries)

    # Set `transpose_b=True` to swap the last two dimensions of `keys`
    scores = torch.bmm(queries, keys.transpose(1, 2)) / math.sqrt(d)
    scores = nn.functional.softmax(scores, dim=-1)
    print('scores', scores.shape, scores)

    return torch.bmm(self.dropout(scores), values)

if name == ‘main’:

X = torch.normal(0, 1, (14, 1, 384))
Y = torch.normal(0, 1, (14, 1, 1024))
Y = torch.normal(0, 1, (14, 1, 1024))
print('X',X)
print('Y',Y)
print('Y',Y)

#hyparameters #t_a_a
key_size = 384
num_hiddens = 1024
dropout = 0.5
model = DotProductAttention(key_size, num_hiddens, dropout)
print(model)
output = model(X,Y,Y)
print('output shape: ', output.shape)

models summary and scores

DotProductAttention(
(dropout): Dropout(p=0.5, inplace=False)
(W_k): Linear(in_features=384, out_features=1024, bias=False)
)
scores torch.Size([14, 1, 1]) tensor([[[1.]],
[[1.]],
[[1.]],
[[1.]],
[[1.]],
[[1.]],
[[1.]],
[[1.]],
[[1.]],
[[1.]],
[[1.]],
[[1.]],
[[1.]],
[[1.]]], grad_fn=)
output shape: torch.Size([14, 1, 1024])

Gorgen · April 11, 2022, 1:41pm

By the way: X = torch.normal(0, 1, (14, 1, 384))，（14: seq_length, 1: batchsize, 384: feature dimension). how to fix it? Thanks, best wishes

Matias_Vasquez · April 11, 2022, 2:01pm

The usual order is Batch x Sequence Length x FeatureDimension.

The way you are defining X and Y, it looks like you have 14 batches, each with only one element.

So the attention of one element with respect with only the same element will return 1,since there are no other elements to compare to (due to normalization and soft max, or else you would get the values squared, but always the same)

I have not tried it but I think this should work:

#                       B, Seq, FeatDim
X = torch.normal(0, 1, (1,  14,      384))
Y = torch.normal(0, 1, (1,  14,    1024))

Hope this helps

Gorgen · April 11, 2022, 2:24pm

Thanks. To be honest, My batchsize is 1, sequcelength is 14, feature dimension is 384.
However, I try your way to exchange dimension. Miraculously it does print out different scores. So the fundamental question is that？
if batchsize is 1, sequence length is 14, feature dimension is 384. How to fix it? Thanks, I am grateful for your help. Thanks, best wishes

B, Seq, FeatDim

X = torch.normal(0, 1, (1, 14, 384))
Y = torch.normal(0, 1, (1, 14, 1024))

Matias_Vasquez · April 11, 2022, 2:35pm

Sorry I do not understand the question.

Batch = 1
Sequence Length = 14
Feature Dimension = 384

Then the correct order is like the one you posted here.

What do you need to fix?

Gorgen · April 11, 2022, 2:48pm

sorry. I am grateful for your detailed guidance for newborn children.

You said the input size has 14 batches. Thus, I guess you have missed something. In fact, the input size (14,1, 384) is (seq_length, batchsize, feature_dimension).
I just put one sample to prove the cross attention works?

By the way, the standard definition of cross attention needed to add 3 different weight layers (query, key, value) . However, in my model I just define query weight, miss key value weight layer. Is it ture?

Thanks
best wishes
jiachen

Matias_Vasquez · April 11, 2022, 2:57pm

What I mean with this

Is that when you do this like in your original post

You do NOT have 1 batch, you are saying 14 batches.

The first number is the batches, so this is wrong.

If you change to this

Then you have 1 batch. This is correct. 1 batch. 14 seq. 384 feature dimension.

Matias_Vasquez · April 11, 2022, 2:58pm

This is not correct, you cannot put the batch size in the middle.

It should be

Batch
Sequence length
Feature dimension

Gorgen · April 11, 2022, 3:33pm

I am really grateful for your detailed guidance. Thanks, best wishes.

Thank you for the pytorch forum. It is such a nice platform.
best wishes