YOLO + ViTPose + LSTM for sequence classification

Hello. I am trying to create a simple skeleton based posture recognition system for animals. i have fine tuned a yolo model. for every frame of a video, i detect a dog using yolo and then compute 2D skeletal keypoints using ViTpose. Then i compute the following features:

  1. bounding box aspect ratio
  2. all keypoint coordinates normalized with respect to the bounding box center.
  3. some joint angles (unsigned) (using torch.cosinesimilarity)
  4. some keypoint distances(signed, normalized wrt to bounding box).
    to factor in occlusions and low conf predictions, i create a mask array of 1’s and 0’s. this is of the same shape as the features.
    in my custom dataset class, i have:
def create_overlapping_sequences(self, features, masks, class_idx):
        num_frames, num_features = features.shape[0], features.shape[1]
        sequences = []
        labels = []
        mask_sequences = []
        for frame_index in range(0, num_frames - (self.sequence_length - self.stride), self.stride):
            sequences.append(features[frame_index : frame_index + self.sequence_length])
            mask_sequences.append(masks[frame_index : frame_index + self.sequence_length])
            labels.append(class_idx)
        return sequences, mask_sequences, labels

to create overlapping sequences to sort of augment data. and then in the lstm classifier, i do:

def forward(self, x, mask=None):
        """
        Forward pass
        
        Args:
            x: Input tensor of shape (batch_size, sequence_length, num_features)
        
        Returns:
            logits: Output tensor of shape (batch_size, num_classes)
        """
        batch_size, seq_len, num_features = x.shape
        
        if mask is not None:
            x = x * mask
        
        x = self.input_projection(x)
        
        lstm_out, (hidden, cell) = self.lstm(x)
        last_output = lstm_out[:, -1, :]
        logits = self.classifier(last_output)
        
        return logits

i am trying to handle occluded features while masking. for example, while calculating angle between 3 joints, if i get all kpt’s visibility as 1, i add the value in degrees to my features array and add a mask 1 in mask array. if even one kpt has visibility 0, i add mask 0 and feature -999 (a large value). same for all other features including normalized kpt coordinates (normalized with bbox center) and inter-joint distances.

no matter what i do, this classifier keeps putting out the same label for all postures. how do i go about solving this problem? have been at it for 2 days now. at my wits end