Inputs for torch.nn.MultiheadAttention

ksasi · December 27, 2023, 6:18pm

Dear Community,

I am reviewing the API definition and documentation for “torch.nn.MultiheadAttention”. It appears the forward method needs query , key , value. Does this mean query, key, value needs to be learnt outside “torch.nn.MultiheadAttention”?

Assuming x is the input, does this mean the following is incorrect? (we are directly passing the same input x for query, key and value instead of obtaining these embeddings)

self_attent = torch.nn.MultiheadAttention(embed_dim=256 , num_heads=8)
attn_output, attn_output_weights = self_attent(x, x, x)

Please advise.

Thanks !!!

nairbv · December 28, 2023, 5:18pm

it’ll do the projections internally. see parameters like q_proj_weight in torch.nn.modules.activation — PyTorch 2.1 documentation

Code like in your example self_attent(x, x, x) is pretty typical, and the k/q/v get projected from that just fine. Distinct k/q/v can matter in cases like cross attention, because the size of the query doesn’t have to be the same as the size of the k/v. If you’re doing a decoder-only or encoder-only model it shouldn’t matter.

You can find more detail in how specific cases are implemented in attention.h, e.g.

github.com

pytorch/pytorch/blob/main/torch/csrc/api/include/torch/nn/functional/activation.h#L685


      
          TORCH_INTERNAL_ASSERT(key.sizes() == value.sizes());
          
          const auto head_dim = embed_dim / num_heads;
          TORCH_CHECK(
              head_dim * num_heads == embed_dim,
              "embed_dim must be divisible by num_heads");
          const auto scaling = 1 / std::sqrt(head_dim);
          
          Tensor q, k, v;
          if (!use_separate_proj_weight) {
            if (torch::equal(query, key) && torch::equal(key, value)) {
              // self-attention
              const auto chunks =
                  F::linear(query, in_proj_weight, in_proj_bias).chunk(3, /*dim=*/-1);
              q = chunks[0];
              k = chunks[1];
              v = chunks[2];
            } else if (torch::equal(key, value)) {
              // encoder-decoder attention
              // This is inline in_proj function with in_proj_weight and in_proj_bias
              auto _b = in_proj_bias;

ksasi · December 29, 2023, 7:13pm

Thankyou. This is very helpful.