How Spatial Reduction works at Pyramid Vision Transformers

I am trying to understand how spatial reduction attention works but I could not find any source. You can see the formulas below

and related paper: https://arxiv.org/pdf/2102.12122.pdf

Can someone describe it, an example would be great.

I think it works like Linformer, you can see it here: Linformer: Self-Attention with Linear Complexity (Paper Explained) - YouTube