Any help for a facial keypoint detection model?

Hi !

I’m training a model for facial keypoints detection, using the Helen dataset.
I’m using a Imagenet pretrained mobilenetV2 as backbone, retraining only the final layers for 10 epochs on the full dataset.
I’m getting very poor results and I wanted to know whether someone could help me out. Basically, the model sticks all the points in the center and after the first epoch both the training and validation losses stop decreasing.

I’m not doing particularly fancy things and I don’t see what prevents training.
First, for the model, :

backbone =  models.mobilenet_v2(pretrained = False)

features = backbone.features 
predictions = nn.Sequential(nn.AdaptiveAvgPool2d(output_size=(1, 1)),
                                    nn.Linear(1280, 1000), 
                                    nn.Linear(1000, 1000),
                                    nn.ReLU(inplace = True), 
for p in features.parameters():
    p.requires_grad = False
self.predictor = nn.Sequential(features, predictions)

Then, for training, I do:

predictor = FaceModel()
loss_fn = nn.MSELoss(size_average = False)
adam = optim.Adam([p for p in predictor.parameters() if p.requires_grad], lr = 3e-4)

for epoch in range(epochs):
        predictor =
        train_loss = 0. 
        for i, sample in enumerate(train_loader): 

            x = sample['image']
            y = sample['points']

            preds = predictor(
            loss = loss_fn(preds,

           # keep track of metrics, 
           # validation loop... 

Concerning data augmentation, I’m first rescaling the images to be 224*224, then I ensure they are float values, comprised between 0 and 1 and I also make sure the keypoints values lie between 0 and 1. I apply horizontal flipping with p = 0.5 and that’s it. Should I use Imagenet stats to normalize image channels ?
Other than that, I really don’t see an obvious error. Could maybe someone point me in the right direction ?
Thanks a lot !

When you apply your horizontal flipping, you most likely aren’t re-assigning keypoints, so the network is learning to get the best of both worlds and output the average. Namely, you need not only only flip the values of the keypoints, but also re-map them to their new keypoints, i.e. the left eye points now go to the right eye points.