PyTorch and facial expressions


I’ve been trying for a couple of months now to train a network to recognise facial expressions.
When I discovered PyTorch I spent time reading all the tutorials, blogs on CNN, and so on.
I’ve spent a good 5-6 years on ML and yet I’m still dumbfounded by how complex training a CNN appears to be.
I’ve got a very large (half a million) dataset of facial expressions, and a Titan Xp to my disposal.
At first I tried copying simple CNN architectures from various papers. Tried small input (64x64) tried single channel grayscale, etc.
I then tried using existing models (VGG11, VGG19) on grayscale, and recently I tried with RGB.
All my attempts have failed so far. Most won’t even converge, and the few that do (using face valence and not labels/classes) when evaluating are very very poor (less than 35% accuracy).

I want to ask:

  • can I fine tune an existing pre-trained network such as ResNet150 or inception for my task? Do I simply freeze the feature extractors, change the last layer/classifier and re-train? I’ve searched current answers, posts and github repositories but it’s not very clear.
  • does normalisation play an important role? I can calculate means and standard deviation for the dataset
  • should I even bother training a large network from scratch? E.g, VGG19, ResNet18, etc?
  • what kind of accuracy during evaluation should I realistically expect?
  • I’ve tried using data augmentation, but it appears to have made matters worse (e.g., CE never drops below a certain value)
    I can probably throw a second Titan Xp to the task, but this is becoming very stressing. Any help, advise, feedback or criticism is more than welcomed!
1 Like

I assume your facial expression use case is a form of an image classification task, i.e. each image has a single label.
Your first approach sounds fine. Fine tuning a model is often easier than training the model from scratch.
If you try to fine tune a model, you should try to stick to the preprocessing of the pre-trained model as much as possible. E.g. your pre-trained model was most likely trained with normalized images. Your images should therefore perform the same normalization.
Generally, using the ImageNet mean and std works reasonably well on “natural” images, i.e. color images from approx. the same domain. Medical images from a CT scanner might need other values, but that’s not the case in your task.

I would recommend to start with a really small dataset, e.g. one single image or maybe 10 images. If your model can’t overfit this tiny dataset, you might have a bug somewhere in your code.
If it works, you could try to scale up a bit.
I wouldn’t start with all images at first, as this might make debugging hard.

If your model performs well, you could continue with adding data augmentation.
Generally I would try to focus on small and simple use cases and make sure there are no obvious bugs.


Thanks a ton ptrblck this makes a lot of sense. Throwing a half a million images at it is probably the problem here. I’ve removed norm, and been using half so I’ll just try with AlexNet for now, and use a small subset of the full dataset.

I am back with an update:

  1. I am using AlexNet pretrained. It fails to converge at all (CE remains higher than 1.0)
  2. When I unfreeze the convolution layers it starts to learn, but only so slightly.
  3. I am using a small dataset (4000 images), and when training it achieves about 0.1 to 0.01 CE however upon evaluation it is less than 30% accurate (correct out of total evaluated images)
  4. I’ve played with the hyperparameters quite a bit, learning rate, momentum, L2, epocs, etc.
  5. I am using RGB and not grayscale, and the default 224x224 input. I normalise and resize to 224x224

I am wondering:

  1. Why am I getting better results when I try to classify valence instead of expression labels (3 classes instead of 7)
  2. Why am I stuck at similar accuracy regardless of hyperparameter changes (about 25% to 30%)
  3. Should I be increasing the training size?
  4. Is my image preprocessing enough? E.g., resize and normalise?
  5. Does using less output classes make such a big difference?
  6. Why does the CE yo-yo up and down in most cases, is this an indication of failing to converge?

@ptrblck I appreciate any help or advise here, I know my questions are not so much about pytorch.

How is your training accuracy? Could you overfit your data using 4000 images?
If not, could you first try to get a nearly perfect accuracy using very few samples, e.g. 10?
If that’s not possible, you might have a bug somewhere, e.g. forgetting to zero out the gradients.

How are you normalizing your images? Are you using the ImageNet mean and std or are you calculating both using your data?

What kind of data are you using, i.e. did you generate it yourself or are you using a dataset from someone else? Can you estimate how clean the data is? Could it be that expression labels are mixed sometimes?

How large is your batch size? A small batch size will look noisier than a bigger one.
If your loss does not have a decreasing trend, the training is stuck.

Could you post your code so that we can have a look for obvious bugs?

Hi @ptrblck

Let me try and answer with a list:

  • Data-set is AffectNet:
  • I’ve removed certain labels as they seem to confuse the networks (uncertain, non-face, etc)
  • My top-1 with 7 labels is always below 30%, if I use only valence score (2 output classes) it goes up to 62%
  • I’ve calculted the mean and std of the entire data-set and I’m using it for normalisation. Using Imagenet mean and std was 1% to 2% worse only.
  • Input is RGB 224x224
  • The data seems to be from internet sources, I have no opinion on how “clean” or accurate it is, but the paper says that it has been human-annotated/labeled for the part I am using. I am open to using other datasets if you know a better one to suggest

I have 3 python scripts really, one which loads the custom dataset (I’ve written unit tests) and basically does the following:

  • load image
  • normalise and tensorify

A script of models which wraps around the output layers of AlexNet, SqueezeNet, VGG11, VGG19 (Only experimenting with AlexNet so far)

And the train and evaluate script with all the hyperparameters.
I have tuned it down to using:

  • 0.001 learning rate
  • 0.9 momentum and SGD optimiser in combination with CE loss
  • 0.0005 L2
  • 100 epochs
  • 64 or 128 batch size
  • 4000 images for training (out of about half a million) which are randomly picked. I am filtering uncertain, unknown and non-faces
  • I am using a MultuStepLR which seems to slightly improve accuracy

I can post the code if needed, but my most striking observation is that changing the output from 7 labels to 2 labels makes a huge impact as already mentioned. Trying with the original 11 labels would produce at best 20% top-1 accuracy.
I’ve been logging all training attempts in a mongo DB since this morning as I am trying to take a systematic and methodological approach. One observation I’ve made is that CE was exploding to a NaN when I used half instead of float and that with larger training data-sets it would go beyond 1.0 easily.
Overfitting seems to happen with the smaller training set (I got CE down to 0.000X) but it is always a bit noisy.
I’ll commit the code and then copy-paste the training and model scripts.

The training, evaluation and hyperparameter script is this

import torch
import torch.nn as nn

from affectnet_cpu import affectnet_cpu
from affectnet_gpu import affectnet_gpu
from evaluate import evaluate
import logger as logger
import models as models

num_epochs    = 200
batch_size    = 256
learning_rate = 0.001
momentum      = 0.9
l2_reg        = 0.0005
datasize      = 5000

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

net = models.alexnet()

train_data = affectnet_gpu('../data/affectnet_images',

train_loader =,
criterion = nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(filter(lambda p: p.requires_grad, net.parameters()),
                            weight_decay=l2_reg, #try with zero!

scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer,
CE = []
total_step = len(train_loader)

for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        inputs = images.cuda().float()
        labels = labels.cuda().long()

        outputs = net(inputs)

        ideal = labels.argmax(1)
        loss = criterion(outputs, ideal)

        # Backward and optimize

        if (i+1) % 20 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                        .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

test_data = affectnet_gpu('../data/affectnet_images',
test_loader =,

correct, total = evaluate(net, test_data, test_loader)
print(correct, total)
accuracy = float(float(correct) / float(total))
print("accuracy", accuracy)

The file is the following. Please note I adjust accordingly, to either 8 labels or 2 depending on if I’m using valence or expression labels.

import torch
import torch.nn as nn
import torchvision.models as models

Custom VGG11
class VGG11(nn.Module):
    def __init__(self):
        super(VGG11, self).__init__()
        self.fc = nn.Linear(1000, 8) = models.vgg11(pretrained=True)
        for p in

    def forward(self, x):
        f =
        y = self.fc(f)
        return y

Custom VGG19 with BN
class VGG19BN(nn.Module):
    def __init__(self):
        self.layer1 = nn.Linear(1000,8) = models.vgg19_bn(pretrained=True)
        for p in

    def forward(self,x):
        f =
        y = self.layer1(x1)
        return y

Custom AlexNet 2 label output
class alexnet(nn.Module):
    def __init__(self):
        super(alexnet, self).__init__()
        self.fc = nn.Linear(1000, 2) = models.alexnet(pretrained=True)

    def forward(self, x):
        f =
        y = self.fc(f)
        return y

Custom SqueezeNet 2 label output
class squeezenet(nn.Module):
    def __init__(self):
        super(squeezenet, self).__init__()
        self.fc = nn.Linear(1000, 2) = models.squeezenet1_1(pretrained=True)

    def forward(self, x):
        f =
        y = self.fc(f)
        return y

I know I can do transfer learning which I’ve tried with Imagenet-trained networks and they all seem to produce worse accuracy, but I am willing to try again if you think it can work.

The actual affectnet script has two classes, one for CPU and one for GPU.
Just adding it here for clarity:

import torch
from torchvision import transforms
import pandas as pd
import os
import stat
from PIL import Image
from labels import labels
from random import shuffle
from import Dataset
import threading

MAX_GPU_MB = 10980000000

class affectnet_gpu(Dataset):

    def __init__(self, img_path, csv_path, limit=414798):
            image_path (string) is where the annotated images are
            csv_path (string) is where the CSV files (training and testing) are
        self.img_path   = img_path
        self.labels     = labels(pd.read_csv(csv_path), img_path, limit)
        # *NOTE* the means and stds are on RGB 3 channel 224x224 images
        self.means      = [0.54019716, 0.43742642, 0.38931704]
        self.stds       = [0.24726599, 0.2232768, 0.21396481]
        normalize       = transforms.Normalize(self.means, self.stds)
        self.preprocess = transforms.Compose([transforms.Resize(size=(224,224)),
                                              normalize])  = []
        print("Pre-processing and allocating data")
        for idx in range(len(self.labels.rows)):
            if torch.cuda.memory_allocated() < MAX_GPU_MB:
        print("using affectnet set: ", len(

    # upload to CUDA/GPU a half float `FP16` input tensor and its equivalent output label
    def upload_pair(self, idx):
            @param idx (unsigned int) is the item index in the dataset
        pair = self.process_row(idx)
        in_tensor  = pair[0].cuda(non_blocking=True).float()
        out_tensor = pair[1].cuda(non_blocking=True).float()[in_tensor, out_tensor])

    # pre-process a row by opening the image, creating an output/label tensor
    # and setting it correctly, and then returning the pair, to be uploaded on the GPU
    def process_row(self, index):
            @param idx (unsigned int) is the item index in the dataset
        item   = self.labels[index]
        file  = self.img_path + "/" + item["file"]
        img   =
        array = self.valence(index)
        #array = self.classes(index)
        return self.preprocess(img).pin_memory(), array.pin_memory()

    # access an item in the dataset using @param index
    # @return a tuple of **input** tensor, **output** tensor
    def __getitem__(self, index):
            @param index (unsigned int) is the item index in the dataset
            @return a pair already pre-processed and allocated on the GPU

    # get dataset length (size)
    def __len__(self):
        return len(

    # calculate classes output
    def classes(self, index):
        item   = self.labels[index]
        array = torch.zeros((8,), dtype=torch.long)
        array[item["expression"]] = 1
        return array

    # calculate valence output 
    def valence(self, index):
            @param pass the `label` and create the correct output
            @return a vector of [x,y,z] where:
                - `x` is positive
                - `y` is neutral
                - `z` is negative
        array = torch.zeros((2,))
        item  = self.labels[index]
        score = item["valence"]
        if score > 0.0:
            array = torch.tensor([1, 0], dtype=torch.long)
        elif score < 0.0:
            array = torch.tensor([0, 1], dtype=torch.long)
        return array

In general I’ve tried to follow all tutorials on PyTorch, searched on Stackoverflow and the forums here, did all examples, etc. I am suprised that the 8 label classification fails so miserably with AlexNet and I guess the 2/3 label classification when using valence, at top-1 accuracy 62% is to be expected?

PS: I haven’t added the labels function, and the evaluate is:

import torch
import torch.nn as nn

from affectnet_cpu import affectnet_cpu
from affectnet_gpu import affectnet_gpu

def evaluate(model, test_data, test_loader):
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in test_loader:
            images = images.cuda(non_blocking=True).float()
            labels = labels.cuda(non_blocking=True).long()
            ideal  = labels.argmax(1)

            # compute output
            outputs = model(images)
            _, predicted = torch.max(, 1)

            total += ideal.size(0)
            correct += (predicted == ideal).sum().item()

    return correct, total
1 Like

If I can offer a suggestion … This probably isn’t what you’re looking for (I’m actually having a similar problem getting Inception nets and things to recognise pyramids vs cubes – generated on the fly … Which I thought would be 1,000x easier … Same models that get 98% on MNIST in seconds can barely get above 50:50 … Fully batch normalised conv nets and everything).

However, if the task is principally what you’re trying to achieve, a route I believe human expression detection uses would be to first identify facial landmarks, then train another network to identify expressions on that data.

And for that you can use this to extract features:

from PIL import Image, ImageDraw
import face_recognition

image = face_recognition.load_image_file("obama.jpg")

face_landmarks_list = face_recognition.face_landmarks(image)

print("I found {} face(s) in this photograph.".format(len(face_landmarks_list)))

for face_landmarks in face_landmarks_list:

    for facial_feature in face_landmarks.keys():
        print("The {} in this face has the following points: {}".format(facial_feature, face_landmarks[facial_feature]))

    pil_image = Image.fromarray(image)
    d = ImageDraw.Draw(pil_image)

    for facial_feature in face_landmarks.keys():
        d.line(face_landmarks[facial_feature], width=5)

And that will give you this relatively quickly:

And you could use a very basic neural net on the vector data that generates – only a few dozen numbers; neatly categorised by facial landmark … Certainly how I’d do it – simplify the problem as much as possible … Partly because I’m on a Macbook – first thought is always: how can I avoid having to throw 10 billion numbers at a problem?

@Swift2046 Thank you very much for the suggestion, I will definitely look into that, since AffectNet already comes with landmarks detected. It also makes sense if processing and accuracy increase.
I’m guessing you are using DLib for the landmark detection? I’m already using MT-CNN for face detection which seems to be more accurate than OpenCV, so all I’m looking at really is extracting the landmarks.

Just this most recently:

Which seems to be a sort of wrapper for Google Vision Face Detection API, Microsoft Projectoxford Detection API and Akamai Image Converter API? I think I had slightly less consistent results with OpenCV.

I took a detour after struggling to train conv nets on these more abstract tasks, and that’s what led me to look at Recursive Convolutional Networks – so I might have a block that’s a 7x7 conv net, and then I’ll flatten it using a really long kernel, like 1x128 with 128 layers; feed that into a 2-layer LSTM; back to a linear layer … Seems to be a noticeable improvement on standard datasets

@Swift2046 How would you go about normalising features for ConvNets?
AFAIK, it is similar to RGB normalisation, where a value represents a coordinate in the matrix?
I’m still trying with AlexNet and managed to get 66% top-1 for Valence (-1 to 1) score and I guess I can make it go a bit higher with more data and/or better networks.
Since AffectNet already has the face features included in the CSV data I’m very inclined to test it.

Well I should warn, I’m a hack – I can get things to work, but generally benefit from talking to people who understand what they’re doing and know the terminology better.

I use this structure, so every Conv layer is batch normalised:

class BasicConv2d(nn.Module):

    def __init__(self, in_channels, out_channels, **kwargs):
        super(BasicConv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs) = nn.BatchNorm2d(out_channels, eps=0.001)

    def forward(self, x):
        x = self.conv(x)
        x =
        return F.relu(x, inplace=True)

But with vector features, I’d just do a simple divide by np.max or the image size – but there might be something I’m overlooking there?

@Swift2046 yeah don’t worry I do the same quite often…
I’d assume since the features are coordinates of pixels, they should be normalised with respect to max coordinates?
Not sure if it makes sense using tensor normalisation later on.
I’ll give it a try until the end of the week and report back.

Currently VGG11 is giving me decent results on the small subset (65% Top-1) when using valence scores.

Absolutely – you could either divide them all by the image height and width you’re using, or I suppose normalise each to fill the frame by first subtracting the minimum x value, then dividing by the maximum; doing the same with y; then take the larger of the two values and divide both x and y by that (to avoid stretching the coordinates).

That’s what I’d do, thinking about it.

And yeah, since you’d be using 1s and 0s, there’d be no further need for normalisation … I’d be interested to try it myself.

Excuse the messiness – not tidied up or optimised yet, but just to see how it’d work.

This gets 55.9% accuracy relatively quickly with the Kaggle Facial Expression dataset, with 7 possible expressions, after converting them to feature vectors first. However it fails to convert about 29% of the original dataset – presumably because it wasn’t trained on small, grayscale images. So you’d have to mark it down on the challenge. The best model on Kaggle got about 70%, with the runners up around 50%, but you might be able to get this higher with more training.

import torchvision.datasets as dsets
import torchvision.transforms as transforms
import csv
import os
from PIL import Image, ImageDraw
import face_recognition
import torch
from torch import nn
import numpy as np
from import Dataset, DataLoader
import math

def str2act(s):
    if s is 'none':
        return None
    elif s is 'hardtanh':
        return nn.Hardtanh()
    elif s is 'sigmoid':
        return nn.Sigmoid()
    elif s is 'relu6':
        return nn.ReLU6()
    elif s is 'tanh':
        return nn.Tanh()
    elif s is 'tanhshrink':
        return nn.Tanhshrink()
    elif s is 'hardshrink':
        return nn.Hardshrink()
    elif s is 'leakyrelu':
        return nn.LeakyReLU()
    elif s is 'softshrink':
        return nn.Softshrink()
    elif s is 'softsign':
        return nn.Softsign()
    elif s is 'relu':
        return nn.ReLU()
    elif s is 'prelu':
        return nn.PReLU()
    elif s is 'softplus':
        return nn.Softplus()
    elif s is 'elu':
        return nn.ELU()
    elif s is 'selu':
        return nn.SELU()
        raise ValueError("[!] Invalid activation function.")

class MLP(nn.Module):
    def __init__(self, num_layers, in_dim, hidden_dim, out_dim, activation='relu'):
        self.num_layers = num_layers
        self.in_dim = in_dim
        self.hidden_dim = hidden_dim
        self.out_dim = out_dim
        self.activation = str2act(activation)

        nonlin = True
        if self.activation is None:
            nonlin = False

        layers = []
        for i in range(num_layers - 1):
                    hidden_dim if i > 0 else in_dim,
        layers.extend(self._layer(hidden_dim, out_dim, False))

        self.model = nn.Sequential(*layers)

    def _layer(self, in_dim, out_dim, activation=True):
        if activation:
            return [
                nn.Linear(in_dim, out_dim),
            return [
                nn.Linear(in_dim, out_dim),

    def forward(self, x):
        out = self.model(x.float())
        return out

def _load_data(path='fer2013.csv', expect_labels=True):

    assert path.endswith('.csv')

    # If a previous call to this method has already converted
    # the data to numpy format, load the numpy directly
    X_path = path[:-4] + '.X.npy'
    Y_path = path[:-4] + '.Y.npy'
    if os.path.exists(X_path):
        X = np.load(X_path)
        if expect_labels:
            y = np.load(Y_path)
            y = None
        return X, y

    csv_file = open(path, 'r')
    reader = csv.reader(csv_file)

    # Discard header
    row = next(reader)

    y_list = []
    X_list = []
    counter = 0
    skip_counter = 0

    for i, row in enumerate(reader):
        counter +=1
        y_str, X_row_str = (row[0], row[1])
        y = int(y_str)
        X_row_strs = X_row_str.split(' ')
        X_row = [float(x) for x in X_row_strs]
        X_row = np.reshape(X_row, (48,48))
        image = np.zeros((48, 48, 3), dtype=np.uint8)
        image[:,:,0] = X_row
        image[:,:,1] = X_row
        image[:,:,2] = X_row
        face_landmarks_list = face_recognition.face_landmarks(image)
        X_row = image

        landmarks_array = []
        for face_landmarks in face_landmarks_list:
            for facial_feature in face_landmarks.keys():
                for item in face_landmarks[facial_feature]:
                    landmarks_array.append(np.round(item[0] / 48, 5))
                    landmarks_array.append(np.round(item[1] / 48, 5))

        if face_landmarks_list:
            skip_counter +=1
    X = np.asarray(X_list)
    y = np.asarray(y_list), X), y)
    print(skip_counter,'missed, out of', counter, ' - total:', (counter - skip_counter) / counter, '%')

    return X, y

class PrepareData(Dataset):
    def __init__(self, x, y):
        self.x = torch.from_numpy(x) if not torch.is_tensor(x) else x
        self.y = torch.from_numpy(y) if not torch.is_tensor(y) else y

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

X, y = _load_data()

NUM_TRAINING_IMAGES = int(len(X) * 0.9)

trainer_loader = PrepareData(x=X[:NUM_TRAINING_IMAGES], y=y[:NUM_TRAINING_IMAGES])
test_loader = PrepareData(x=X[NUM_TRAINING_IMAGES:], y=y[NUM_TRAINING_IMAGES:])


# Hyper Parameters
EPOCH = 1000
LR = 0.0001

# Data Loader for easy mini-batch return in training
train_loader = DataLoader(dataset=trainer_loader, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_loader, batch_size=1, shuffle=True)

model = MLP(num_layers=5, in_dim=144, hidden_dim=256, out_dim=7)

optimizer = torch.optim.Adam(model.parameters(), lr=LR)
# optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.7)
loss_func = nn.CrossEntropyLoss()
average_loss = 2
correct_ratio = 0

for epoch in range(EPOCH):
    for step, (b_x, b_y) in enumerate(train_loader):
        # b_x = b_x.view(-1, 28, 28)        # reshape x to (batch, time_step, input_size)
        output = model(b_x)
        loss = loss_func(output, b_y)

        if step % 10 == 0:
            average_loss = average_loss * 0.95 + loss.item() * 0.05
            print('Epoch:', epoch, '- step:', step, '- loss:', np.round(loss.item(),5), 
                '- average loss:', np.round(average_loss, 5), '- correct:', correct_ratio, '%')
    if epoch % 5 == 0:
        correct = 0
        incorrect = 0

        for data, target in test_loader:
            predicted_answer = np.argmax(model.forward(data).detach().numpy())
            right_answer = target[0]
            if int(right_answer) == int(predicted_answer):
        print('Test Correct: {}/{}'.format(correct, correct+incorrect))
        correct_ratio = np.round(100 * (correct / (correct+incorrect)), 1)

Hi Alex, I went through your code in the file. Here’s what I thought can be changed to make it train in a better way.

Typically, when doing Transfer Learning, we use take a network trained on a dataset D, remove the final classification layer used for that dataset D, and then add our own final layer for classification. The objective here is to learn from the final encodings learnt from dataset D. Hence, a better way of utilizing the pre-trained network is shown below for VGG11.

""" Custom VGG11 """ 
class VGG11(nn.Module): 
    def __init__(self): 
        super(VGG11, self).__init__() = models.vgg11(pretrained=True)
        # Replacing final classification layer with custom classification layer[-1] = nn.Linear(4096, 2) 
    def forward(self, x):
        f = 
        return f

Hi @Mazhar_Shaikh thanks for the source and advise.
I’m a bit confused though, what is the purpose of removing the last classification layer and replacing it with another?
I know that in the approach I take, I’m adding an extra layer on top of it, which probably delays learning by a small amount. Is that what you have changed?

@Swift2046 Wow, I am speechless, thanks for the script! I’ll try and test it on Friday.
I’m guessing from a brief look that you’re using 48 features and wrote a custom MLP to test classification?
I’ll let you know if/how it works on the AffectNet dataset.

Hi Alex, This may be a good place to read up on Transfer Learning.

The imagenet pretrained network’s final output are the class log probabilities of the 1000 classes present in imagenet. Hence, it is possible that that particular final layer is in a deep local minima and may be hard to escape from. Effectively, the classification you were performing looks like
P(face expression| face image) = P(imagenet class| face image) * P(face expression|imagenet classes)

This would be a problem only with a pretrained=True network, correct?
AFAIK what you’re suggesting minimises network complexity so I’ll try it with my next iteration.
Many thanks!

It was a good challenge! I was intrigued how well a MLP would handle coordinates on a task like this … Better than I thought … With a higher learning rate, it gets to 50% in one epoch.

It’s actually using 72 features (X & Y coordinates, meaning 144 inputs) … I was wondering if you could strip the chin, and just focus on eyes, nose and mouth … The images are 48 x 48 grayscale, so I’m dividing by 48 to normalise (because I was too lazy to write something that would crop them too – but that might be a quick way to improve accuracy).

Quite a bit of the _load_data routine is putting grayscale image data into an RGB format the face_recognition script would recognise … So that would need modifying for colour (which I think would perform a lot better) … I’ll try and get hold of the AffectNet dataset too … The MLP is actually just a nice, quick customisable class I picked up from Andrew Trask and co’s NALU paper … Become a go-to vanilla neural net for me