At this point, something you should try would be to reduce your dataset to a really small subset of videos / frames and use that same data as both train and test. If you can’t (over)fit that data, something is not set up properly.
I think conventionally you do want to have an input normalization to match the GoogleNet parameters (like here) and I’m not aware of any issues with cross-entropy error, but regardless of the normalization you should be able to overfit a small data sample.
If you can overfit the small sample, but you still can’t get the model to work on the larger dataset, perhaps you just don’t have enough data / the right architecture / the right hyperparameters.