Hello. I am trying to implement the following 2 features and I am curious if it is possible with FlexAttention.
- I would like to use a different value for softmax numerical stabilization. Currently, softmax uses exp(value - max_value) for numerical stability. However, this is inefficient because we have to do two memory accesses: once to find the max value and once to get the sum. If the max_value can be set to a constant, this would make the implementation simpler. Also, combined with soft clipping with tanh, there would also be no overflow.
- Would it be possible to implement Softmax1 from “Attention is off by 1”? I am aware it does not have much effect in practice, but I dislike using an incorrect equation.