How do I define the category of this task

I have a task. I want to finish a task that automatically extracts a key frame from an ultrasound video sequence of the abdomen.
The feature in the key frame is obvious where the bladder on that frame has a distinct balloon on it because of the injection of water. The video is a real-time image of the ultrasound probe moving around all the time.
I’ve been working on static medical images. I don’t know how to define this task.
I have several ideas:

  1. Frame classification. Label a video into two categories: 0,1. The target frame is marked as one, and the rest is marked as zero. The possible problem is that the sample is extremely unbalanced.
    2.Two stage strategy. At first, I design a Deep network to detect key frames, which can also be seen as a selection of the proposal frames. Then, a Judge network can be added as follows to classify whether it is the target key frame.
    3.End-to-end framework. The input is a video sequence and the output is the key frame. (MANY-TO-ONE). But it seems that this problem is more common in natural language processing which involves RNN.

In summary, I reviewed the video process based on DL. I am confused about how to define my task. Who can help me to analyze?

PS: The network should be designed to input real-time video images, so may the essence of the image processing problem??