HOG features 1D for video classification

I need support to design a model for video classification , action recognition based on HOG handcraft computed. I use the HAR-Up dataset. The idea is to use LSTM or CNN FC for video , taking temporal dimension with input 1d features for each frames , and after that, I plan to use only the Fc flatten layer to be connected with other stream