Video action recognition with small dataset (non human). What model would you recommend?

I’m planning to do a video action recognition network (for primate videos, not human). The video dataset will be small in comparison to the gigantic human-centric ones available for benchmark and training. The only advantage I see is that I will have a very reduced number of classes in comparison to the human datasets.

My question is, from the available models of video action recognition which one do you think it’s a good starting point for doing transfer learning considering my data limitations?

Thanks