I can't find the source code of _scaled_dot_product_attention

Hi,

I want to get the gradient of attention map.
So, I tried to inject the hook function in torch.nn.functional._scaled_dot_product_attention module.

However, I cannot find any source code of _scaled_dot_product_attention in pytorch github.

Where can I find it?

Thank you, and Happy new year

You can find the source code here.

I tried to find the source code based on python.

But, I missed the cpp based code.

Thank you!!!

Hi,
I couldn’t find this path on my env, the only thing I found is in functional.py:
lib/python3.10/site-packages/torch/nn/functional.py
and the code is like :
scaled_dot_product_attention = _add_docstr(
torch._C._nn.scaled_dot_product_attention, r"“”
scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False) → Tensor:

Computes scaled dot product attention on query, key and value tensors, using
an optional attention mask if passed, and applying dropout if a probability
greater than 0.0 is specified.

… code-block:: python

# Efficient implementation equivalent to the following:
attn_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0) if is_causal else attn_mask
attn_mask = attn_mask.masked_fill(not attn_mask, -float('inf')) if attn_mask.dtype==torch.bool else attn_mask
attn_weight = torch.softmax((Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))) + attn_mask, dim=-1)
attn_weight = torch.dropout(attn_weight, dropout_p)
return attn_weight @ V

… warning:: This function is beta and subject to change.

Note:

There are currently three supported implementations of scaled dot product attention:

    - `FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness`_
    - `Memory-Efficient Attention`_
    - A PyTorch implementation defined in C++ matching the above formulation

The function may call optimized kernels for improved performance when using the CUDA backend.
For all other backends, the PyTorch implementation will be used.

All implementations are enabled by default. Scaled dot product attention attempts to automatically select the
most optimal implementation based on the inputs. In order to provide more fine-grained control over what implementation
is used, the following functions are provided for enabling and disabling implementations.
The context manager is the preferred mechanism:

    - :func:`torch.backends.cuda.sdp_kernel`: A context manager used to enable/disable any of the implementations.
    - :func:`torch.backends.cuda.enable_flash_sdp`: Enables or Disables FlashAttention.
    - :func:`torch.backends.cuda.enable_mem_efficient_sdp`: Enables or Disables Memory-Efficient Attention.
    - :func:`torch.backends.cuda.enable_math_sdp`: Enables or Disables the PyTorch C++ implementation.

Each of the fused kernels has specific input limitations. If the user requires the use of a specific fused implementation,
disable the PyTorch C++ implementation using :func:`torch.backends.cuda.sdp_kernel`.
In the event that a fused implementation is not available, an error will be raised with the
reasons why the fused implementation cannot run.

Due to the nature of fusing floating point operations, the output of this function may be different
depending on what backend kernel is chosen.
The c++ implementation supports torch.float64 and can be used when higher precision is required.
For more information please see :doc:`/notes/numerical_accuracy`

Note:
{cudnn_reproducibility_note}
“”“.format(**reproducibility_notes)
+ r”“”

The binaries do not ship with C++ code as it’s compiled into the libtorch_*.so libs.

Where is the C++ code to customize it? I couldn’t find attention.cpp.
Thanks!

You can find the source code on GitHub by following the link I’ve posted.
To manipulate it, you would need to git clone the repository, manipulate the corresponding files, and build PyTorch from source.

1 Like