How to implement action detection with RNN

I tried to implement the action detection model using LSTM follow this figure

image

where x is the sequence feature extracted from a CNN model, in this case, I used Resnet50, and y is a class predicted in each time step.
I use batch size 1 and Lr is 0.00001, and the target of each sequence looks like

[32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,  
0,  0,  0,  0, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  
0,  0,  0,  0, 7,  7,  7,  7,  7,  0,  0,  0,  0,  0,  0,  0,  0,  0,  
0,  0,  0,  0, 0,  0,  0,  0,  0,  0,  0, 36, 36, 36, 36, 36]

I computed loss function in every time step using Cross-Entropy and average them.

I found the loss graph looks unusual.

do you have any idea what happened?
and I’m not clear about my loss function computation.
what is the correct way to compute loss in this case?
is it possible that the problem come from loss function computation?

thank you,

Hi, Can you explain more about your training data? Is it one single frame? or a sequence of frames? How is your RNN model? Is it many to one or many to many?

thank you for your reply,
the training data is sequence of frames and the RNN model is many to many type.