Hi,
I couldn’t find this path on my env, the only thing I found is in functional.py:
lib/python3.10/site-packages/torch/nn/functional.py
and the code is like :
scaled_dot_product_attention = _add_docstr(
torch._C._nn.scaled_dot_product_attention, r"“”
scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False) → Tensor:
Computes scaled dot product attention on query, key and value tensors, using
an optional attention mask if passed, and applying dropout if a probability
greater than 0.0 is specified.
… code-block:: python
# Efficient implementation equivalent to the following:
attn_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0) if is_causal else attn_mask
attn_mask = attn_mask.masked_fill(not attn_mask, -float('inf')) if attn_mask.dtype==torch.bool else attn_mask
attn_weight = torch.softmax((Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))) + attn_mask, dim=-1)
attn_weight = torch.dropout(attn_weight, dropout_p)
return attn_weight @ V
… warning:: This function is beta and subject to change.
Note:
There are currently three supported implementations of scaled dot product attention:
- `FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness`_
- `Memory-Efficient Attention`_
- A PyTorch implementation defined in C++ matching the above formulation
The function may call optimized kernels for improved performance when using the CUDA backend.
For all other backends, the PyTorch implementation will be used.
All implementations are enabled by default. Scaled dot product attention attempts to automatically select the
most optimal implementation based on the inputs. In order to provide more fine-grained control over what implementation
is used, the following functions are provided for enabling and disabling implementations.
The context manager is the preferred mechanism:
- :func:`torch.backends.cuda.sdp_kernel`: A context manager used to enable/disable any of the implementations.
- :func:`torch.backends.cuda.enable_flash_sdp`: Enables or Disables FlashAttention.
- :func:`torch.backends.cuda.enable_mem_efficient_sdp`: Enables or Disables Memory-Efficient Attention.
- :func:`torch.backends.cuda.enable_math_sdp`: Enables or Disables the PyTorch C++ implementation.
Each of the fused kernels has specific input limitations. If the user requires the use of a specific fused implementation,
disable the PyTorch C++ implementation using :func:`torch.backends.cuda.sdp_kernel`.
In the event that a fused implementation is not available, an error will be raised with the
reasons why the fused implementation cannot run.
Due to the nature of fusing floating point operations, the output of this function may be different
depending on what backend kernel is chosen.
The c++ implementation supports torch.float64 and can be used when higher precision is required.
For more information please see :doc:`/notes/numerical_accuracy`
You can find the source code on GitHub by following the link I’ve posted.
To manipulate it, you would need to git clone the repository, manipulate the corresponding files, and build PyTorch from source.