PyTorch and facial expressions

Alex_Ge · October 1, 2018, 12:21pm

Sorry I haven’t replied back yet. Was busy trying to optimise what I have.
Some first empirical observations:

Transfer learning does work. However it wasn’t clear as to how it works. Many blog posts and examples I’ve seen on PyTorch forums say you have to freeze the conv nets. This simply didn’t work for me, mostly due to the fact that all those nets have been trained on ImageNet. What does indeed work and reduces training time significantly, is using a pretrained network (on Imagenet) and re-training it for facial expressions. I imagine a network pretrained on facial recognition would train even faster.
I got 72% top-1 with ResNet34. ResNet101 did not perform as well, and alexnet or VGG11/19 were slightly worse. They all got trained using same hyperparams, same dataset sizes multiple times. There seems to be a trade-off, how much data you are going to train on, and how much you are trying to squeeze out of it. Not sure exactly how that works, but I will be doing more work on this.
I cannot get decent enough top-1 accuracy when using emotion labels. I am using valence (a score of negative, neutral or positive emotions) which atm is 72% accurate. I’ve tested it with a webcam and it appears to work, although it does somewhat fluctuate up and down.
I’m going to try using face landmarks with a deep (not convolutional) network once I have the time to try it, because Benski’s results are very promising. I’m also thinking of trying other datasets since I’m under the impression AffectNet is very noisy or has wrong labels.

Thanks to all those who have helped so far, this topic here will be updated as I progress!

Alex_Ge · October 5, 2018, 5:04pm

Here I am, a few days later with some more results:

@Swift2046 I’ve tried replicating what you did:

used dlib to detect face, and then extract the 68 features using shape predictor
overlayed the lines connecting the shape features (jaw, eyes, nose, mouth) on a binary image
trained using the binary image of the extracted features, and the 8 labels as the output classes
fed all that to an AlexNet, accepting a 3 channel RGB (each channel has the same data)

The preprocessing I do is simply to paint white the facial features detected by Dlib, something like this:

testfile_features

Originally i was drawing lines between the dots, skipping the dots connecting the different groups, but dlib was returning coordinates outside the actual image, so I had to skip those, making drawing lines much harder.

Using a pre-trained network appears to be a bad idea, I can’t get AlexNet (my benchmark ATM) to converge. I tried to train it from Scratch, but again the CE loss seems stuck, even with a small 5000 big data-set.

I’ve also tried using a simple MLP with 1 hidden layer, and the flattened input of the 68 coordinates as one single tensor normalised between 0 and 1. However I get various errors when trying to train it, as I appear to miss-match the flattened array with what the network is expecting (e.g., I failed to actually run the script).

I was really hoping the feature detection thing would work…

justusschock · October 5, 2018, 5:18pm

I’ve done something similar. I assume you used the 68 point labeling scheme from ibug?

I used the same and my solution is somewhat inspired by classical Shape Models.

If you want to give it a try, you could have a look at my repo.

Edit: I also tried something similar with other networks and in these networks it was possible to train it on the facial keypoint-detection task, add one or two FC layers to the last but one layer and fine-tune it to a classification task by training only the appended layers and keeping the others fixed.

I got this working with emotion classification and you should be able to expand this approach to other expressions as well.

Alex_Ge · October 5, 2018, 5:51pm

@justusschock what kind of top-1 and top-5 accuracy did you manage to get, and what output labels did you use? Thanks for the info, I’ll have a look at your repo.
Just managed to get 67% on top-1 with AlexNet so I am guessing I’ve been doing something wrong.

justusschock · October 5, 2018, 5:55pm

I only had 5 labels (5 different emotions and I tested it on Infrared data) but I got an accuracy of about 85-90 percent (top-1; depending on subsets in crossvalidation and whether I kept the previous layers fix or not).

Alex_Ge · October 5, 2018, 6:06pm

Wow, that is what I’ve been trying to do, using IR images but it just seems to fail so hard.
Thanks I’ll take a look at your repo and see if I can use it

justusschock · October 5, 2018, 6:07pm

Which dataset did you use for IR images?

Alex_Ge · October 5, 2018, 6:08pm

Our own collected data using a D415 camera, with IR projectors turned off (just ambient IR light).

Swift2046 · October 5, 2018, 6:31pm

Well the code I posted should work – just Install face_recognition, and everything else is standard … No preprocessing necessary.

https://pypi.org/project/face_recognition/

I imagine the accuracy would be way up there with larger images, especially colour … And you definitely want to get the MLP working, as I don’t imagine coordinate data would be in any way meaningful to a pre-trained network.

When I’m having problems getting inputs to match outputs, I shove a Print(x.size()) or Print(np.shape(x)) in my forward function, where needed … and work out what I need to do … Or the error tells you … It was only a month or so ago that kind of thing used to utterly perplex me.

Alex_Ge · October 5, 2018, 6:34pm

@Swift2046 hmmm, face recognition seems to use dlib so I am guessing it does something similar. I’ll give it a try, and copy paste your MLP because what I’ve tried probably failed hard somewhere.

Alex_Ge · October 5, 2018, 6:55pm

Well I just found out what one of the problems is, the AffectNet dataset is not equal, some of the classes are severely biased against others. I did some digging on a subset of 200K+ images, and this is what I found:

('affectnet entries: ', 287651)
('neutral', ' is ', '0.26029')
('happy', ' is ', '0.46729')
('sad', ' is ', '0.08851')
('suprise', ' is ', '0.04898') 
('fear', ' is ', '0.02217') 
('disgust', ' is ', '0.01322')
('anger', ' is ', '0.0865')
('contempt', ' is ', '0.01304')

So I need to oversample the smaller classes, and even use a WeightedRandomSampler I’m guessing.

Alex_Ge · October 19, 2018, 5:39pm

It seems to me that the AffectNet dataset has contradicting labels of expressions, aside from being unbalanced. Working on in using features extracted, greyscale or rgb as input all produce similar Top-1 accuracy. Using Valence as output is slightly more accurate than expression labels, but again Top-1 ranges from 65% to 70%.
I think I will try with the Kaggle dataset @Swift2046 mentioned and see if I can achieve similar accuracy. Sadly I can’t seem to find big enough Datasets to try with Facial Expressions. I’ve got CK+ but the labels make no sense to me.
If anyone knows of any other Facial Expression Dataset please do let me know.

Alex_Ge · October 24, 2018, 9:06pm

@Swift2046 I used a custom AlexNet (1 channel 224x224 up-sampling) and got 0.6485 Top-1 accuracy. I didn’t use features, because after looking at them, I think they remove a lot of information from the image. I can try with them just for comparison and future reference, but I doubt it will make much for a difference. That is on FER2013 from kaggle.

Mona_Jalal · November 4, 2018, 7:46pm

@Alex_Ge I am working with the same dataset. I saw you had a non-finalized version of your code here. Can you please share your final copy of code with us?

Alex_Ge · November 8, 2018, 5:58pm

@Mona_Jalal I am afraid I don’t have something to share. All I did was write a class which loads affectnet into either CPU or GPU memory, and then just used template code from PyTorch, and the models (ResNet, VGG, AlexNet, etc).
Furthermore, and I am sorry to disappoint, but AffectNet has serious problems, such as excessive noise and contradictory labels. Using FER2013 You can get much better results. AffectNet is also heavily biased on the happy class (it makes up about half the samples).
Trying with valence instead of expression labels you may be able to get better results.
If you still need the AffectNet loader, let me know and I’ll push it on GitHub.

Mona_Jalal · November 8, 2018, 8:36pm

speaking of problems in affectnet, I noticed there are cases that the face is a human but say it has heavy makeup and is noted as non-face. The facial landmarks are quite off too. Seems they are using an older version of OpenFace or some other tools for facial landmarks. It seems to me OpenFace 2.0 has more precise landmarks.
The other problem with AffectNet is that there are images that have size 0. My current plan is to remove those.

Thanks for sharing your observations

Alex_Ge · November 13, 2018, 10:24am

@Mona_Jalal the biggest problem you will have is the contradicting labels. I don’t know how labelling and annotation was done, but it is not as good as in other data-sets. If you have the time or resources, I would crowdsource re-labelling of the dataset.