Video frame recommendation

Now I’m working on video understanding
In my work, My goal is video frame recommendation in certain videos.
For example, inputs are 64 frames in each videos which are 8.
So my inputs are [64 frames x 8 videos].
Ground Truth labels are like this form. e.g. [0, 0, 0, 0, … 1, 1, 1, 1].
0 means first video, 1 means second videos.
Neural Net recommends video frames. Generated outputs like this form [batch_size, 64, 8], 64 means number of frames, 8 means number of input videos.
After that, I want to measure a loss with labels.
But I can’t use Cross-Entropy loss in my works.
How can I use Cross-Entropy loss in my task?