max simply selects the greatest value and ignores the others, so max is the identity operation for that one element. Therefore the gradient can flow backwards through it for just that one element.

We first select Y frames (i.e. keyframes) based on the prediction scores from the
decoder.

The decoder output is [2,320], which means non-keyframe score and key frame score of the 320 frames. We want to find a 0/1 vector according to the decoder output but the process of [2,320] -> 0/1 vector seems not differentiableâ€¦

and also, can you explain it a little bit more?
why the fact that it is identity operation for the max elememt changes the situation?
and if so, why they invented the softmax?