Need starter code for "image sequence" to "number sequence" type LSTM

Folks, in our application we have a camera that is monitoring a scene (e.g., food cooking) and we want to output verdicts in real-time (e.g., cooking doneness). I am looking for a good starter code from where I could start modifying and experimenting.

More details:
We have videos with N frames, where N is a variable number, say, between 1 and 1000 or so. We wish to train a model that will output monotonic sequence of length N, say for N=10, it outputs sequence 0-0-0-0-0-1-1-1-2-2. One number per video frame. WE DO NOT WANT A MODEL FOR WHICH YOU HAVE TO PROVIDE ALL N FRAMES UP FRONT TO GET THE OUTPUT i.e., I have looked at resources on video tagging, action recognition in video, etc and they do not seem relevant