LSTM video analysing

Suppose I have a video dataset which is containing human’s facial expression videos. Each video is corresponding to one expert-labeled-emotion-score (which is a number between 0-10 indicating the intensity of the emotion for the whole video). So, I’d like to train a deep learning model on this dataset using the CNN-LSTM architecture. However, I’m worrying about which time-step in the LSTM output that I need to decode for computing the loss in order to optimise the model’s parameters. Normally, we decode the last output of LSTM, since the network takes the information in the past to predict the current input. However, in my case, the label score is for all frames of the video and therefore, I think decoding the last LSTM output wouldn’t solve the problem. Could you give me some suggestion to solve this problem ?

The number of frames in each video ranging from 150-700 frames.

You can just use the entire result of the lstm output and use that as input to your loss function

# Evaluate the batch
cnn_output = torch.cat([cnn(sequence) for sequence in images], 0)
lstm_output= lstm(cnn_output)

# Calculate the loss of the batch
loss = loss_function(lstm_output, label)

Where the return of your lstm should look like this:

# Fully connected layers
out = self.fc1(out)
out = self.relufc1(out)
out = self.fc2(out)

Instead of only taking the last output

# Fully connected layers
out = self.fc1(out[:, -1, :]) # just want last time step hidden states! 
out = self.relufc1(out)
out = self.fc2(out)

Hello @VictorownzuA11
Thank for your suggestion, but the output shape of LSTM in your case should be (B, N, 1) where N is the number of time steps (equivalent to number of frames). However, the label only have the shape of (B, 1), which is indicating the emotion score for each of the video in the batch (B). Apparently, we can’t compute loss using these output ?

Hey seems I misunderstood your question in my last reply, you have only 1 label (B,1) for each video, but you can extend the label for the entire video (since the score if for the entire video it should be okay to assume that the score is valid for a single frame in that video).

Similarly it should also be okay to only take the last output of the lstm since it should give you the score for the entire sequence (which is the label for the video you inputted into the lstm)

Hi @VictorownzuA11
Because in the video, the emotion will be going up and down, therefore, I think set the same label for all frames is not really correct isn’t it? for example, in some frames, the face has no emotion, which have the score of 0. From my understand, this expert-labeled-emotion-score should be the peak score (or average of top 10 highest scores). Though I’m not very sure because it is labeled by expert in emotion analysis

About using the last frame, I think, the characteristic of LSTM is using the previous information to predict the current input, therefore, it will focus on predicting the score for the current input (which is the last frame). This resulting the last output will be getting a high influence of the last input (please correct me if I’m wrong)

I get that the emotion might change per frame, but if the only label you have is for the entire video these are the only two option I see, since the frames themselves are unlabeled.

The goal of your lstm should be to determine the emotion-score given the sequence/video, so if you provide the entire video (given you can fit it all within your model) the lstm should provide the estimated emotion score as it’s output.

I might be incorrect, but you aren’t using the ‘last frame’ for the output per-say, instead you are getting the last hidden state (over the entire sequence) and using that to determine a final estimate, so if the sequence is long enough (i.e. has some of the ‘highest score’ frames’) it should generalize to the expected score.

Hi @VictorownzuA11

What I mean by using ‘last frame’ is exactly the last hidden state. Sorry if i made you misunderstood. I know that this last hidden state will have the information of the whole sequence and I have tried to train the network using the last hidden state before. However the result is not really good and the only reason for that result that I can imagine is the high impact of the last input to the last hidden state. So, I’m trying to see if there is any other alternative solution for this problem.

Can you provide any more info on the dataset/labels, but it seems that with a sequence of videos and only 1 datapoint of the score for the entire scene, I would guess your best bet is to return all the hidden states (and get a estimated score for each one) and take the average over that to get a result.

Hi @VictorownzuA11

The dataset only has one label score per video, this score is ranging from [0-10]. The number of frames per video is ranging from 150 - 700 frames.

For the average score over all time steps. I’m wonder how it will impact to the back-propagation through time of LSTM ? Since they all point to the same hidden units of the LSTM, therefore, gradient from all time-steps will probably accumulate together, leading to unstable result isn’t it?

Are there different emotions in the videos or just videos of one emotion with a single score?

If its the former you can use one-hot encoding for the emotion and the magnitude of the encoding for the emotion-score.

But I agree, talking the average isn’t an optimal solution but given the lack of labels I’m not sure how else you could approach this.

Hi @VictorownzuA11

It is just one real number indicate the intensity of one emotion. Therefore, I would try to train the network as regression model using MSE.

I see this is a difficult task due to lack of labels. Please let me know if you have any other ideas, and thank you for spending your time on my question.

No problem, just out of curiosity can you provide a link to the dateset? I’d love to take a look at it and the labels too.

The name of the database is UNBC McMaster, you can obtain it by go to this page: https://www.pitt.edu/~emotion/um-spread.htm
What I would like to do is estimating the OPR and/or VAS pain intensity level.