Which model do you recommend?

Which pre-trained model is best for video classification among binary classes where classification mainly depends on the position of a moving target ? I have 2000 videos with 60 frames in each. Also do you recommend instead creating a new CNN-GRU model from scratch for this dataset size ?