What is the best attention calculation formula in NLP?

fullsizerender

Has anyone done a similar study or comparison?Some related papers are weclome.