Using Keypoints for time-series problems

What I would like to do if predict future actions of pedestrians over time. I am using the Keypoint R-CNN model to generate the keypoints, which I will be using to create a skeleton. Based on the evolution of the skeleton over time, I would like to predict what they may do after a number of frames.

Where I am stuck is how to use the images and keypoints together to do this. How would I feed a model the images and keypoints in a way that it will learn the evolution of skeleton over time? I believe I would be using 3D conv nets, but I am unsure of what the best to feed the data.

Do I extract the features of a number of concatenated images and fuse those features with concatenated keypoints for those images and then feed that into a classifier for learning the actions? Can I used torchvision.ops.MultiScaleRoIAlign to downsample the images with respect to predicted keypoints (I hope that is how pooling using MultiScaleRoIAlign works?)

Thanks in advance.