Both training and validation loss not reducing

I am trying to fine-tune a transformer/encoder based pose estimation model available on huggingface

When passing “labels” attribute to the forward pass of the model, the model returns “Training not enabled”. The core logic I have implemented is as follows. Since the model outputs heatmaps, I use a post-processing pipeline to get back the keypoint predictions in the image space, and compute a MSE Loss between these reconstructed keypoint (using soft argmax) and ground truth keypoints.

Is this a correct way of thinking? Comparing heatmap to heatmap might seem more intuitive, but I didn’t want to write the keypoint to heatmap and add the specific image processor’s normalization with the worry that they might go off-scale Things I have tried:

modified the model’s heatmap head to predict for 24 keypoints for dogs instead of the 17 for humans it was trained on 2.added a Simple Adapter network right after the layer norm and before the model’s heatmap head unfreeze some of the backbone layer’s gradually track loss with both normalized and unnormalized keypoints. Added gradient clipping. the post processing pipeline is a differentiable approxiamtion of transformers/src/transformers/models/vitpose/image_processing_vitpose.py at main · huggingface/transformers · GitHub Tuning LRs However, gradient flow through the head and the adapter and the deeper encoder layers have been very very small.

here is the notebook link: ML/ViTpose_update_loss.ipynb at main · sohamb91/ML · GitHub

Looking forward to any discussions/help.

I’m at a complete “loss”.

Edit: i have tried normalising the keypoint coordinates wrt the bounding boxes as well as compare the heat map output of the model with Gaussians generated at the label keypoint coordinates