Supervised classification of OpenFace feature vectors with LSTM

Not only I am quite new to PyTorch but also much newer to LSTM. I have the following problem setup:

  1. I have the features output by OpenFace 2.0 such as facial landmarks, 3d gaze, gaze direction, AUs, etc in the form of a numpy array for each frame of a short videoclip for around 1500 video clips.

  2. I also have a label between 1-8 for each videoclip and respectively for each frame mentioned in 1.

  3. I want to use LSTM and cross-entropy loss between the predicted class for the future frames by LSTM as well as the the groundtruth label I have for each frame.

  4. I am looking for a starter tutorial (perhaps in Google Colab or Jupyterlab/iPython Notebook) that can give me a good starter on a LSTM used for a similar problem. I see many tutorials using it for text or other domains that is different from my setting. In my setting, the input to each LSTM cell is a feature vector in the form of a numpy array.

  5. This possibly is very soon to get to, but if anyone has worked with feeding OpenFace (2.0) extracted features to LSTM (or CNN) cells, what sort of feature normalization do you suggest?

Thanks a lot for your patience and guidance.

Here is a high-level schematic of the task at hand. Please let me know if the task overall does make sense at all or if I should use another methodology to do it?