I have a model implemented in pytorch that applies a final fully connected layer before running the softmax function. The architecture is defined to solve a 4-class Speech Emotion Recognition task: given an audio track, it transforms it into its spectrogram and uses it to predict the emotion between happiness, sadness, neutrality and anger.
Unlike the architecture of the paper, it attempts to adapt the implementation of the Compact Convolutional Transformer found on Github at the link Compact-Transformers/cct.py at main · SHI-Labs/Compact-Transformers · GitHub.
To improve the performance of the model I am following some tricks defined in the paper [2104.07288] Speaker Attentive Speech Emotion Recognition. Like what is described in the paper, however, my model also suffers from a “class collapse” problem: even by balancing the dataset, it tends to predict the anger and sadness classes well and the other two badly.
In the paper to solve this problem they apply a particular weight regularization technique to the fully connected layer, described in chapter 2.4.
Unfortunately, however, I cannot understand how I should modify my fully connected layer in pytorch to implement this type of regularization.
Can someone help me?