I’m trying to do a sentiment analysis algorithm for music with lyrics as an input.
I tried some models but it seems that I can’t get a good one.Here are some details about the problem:
1.The dataset(around 11k samples):
-80% training - 20% validation
-I am using as input 4 lyrics for each song. The lyrics are the ones with the biggest frequency in the song(I have chosen them without duplicates)
-I am preprocessing them(use lower case, remove non-letters and useless whitespaces.
-I am trying to predict 2 numbers : arousal and valence. I can compute the sentiment knowing these
-I also tried the classification problem(transformed the labels directly into sentiments) with 12 classes, but I still didn t get good results.
-after preprocessing I am embedding the words in every lyric. For example if I have 50 words after embedding I get a 50 x embedding_size tensor. The embedding are also trained. I was thinking about using some pre-trained ones like word2vec or GloVe.
2.The arhitecture - I had these approaches:
-using only a GRU, then an output linear layer
-convolutional layer with output linear layer
-prenet using a 2 layered NN with a bottleneck(256 - 128 size), CNN and then linear layer
-CNN, GRU, linear layer
-module with a bank of convolutions of different kernel sizes(2,3,4) followed by GRU and linear layer
-a few more with similar results
As a note, I mainly use ReLU as activation and tanh in the final layer. After applying tanh I multiply the result by 3 to get numbers between -3,3(valence and arousal are between these values).
I use MSE as a loss function. I noticed that the loss goes around 1(maybe 0.6 in happy cases) and then stays there. The accuracy can’t get better than 20% and it usually goes to 5-10% after some iterations. The train loss and validation loss are usually decreasing.
I am training with mini-batches. I used as an optimizer SGD or Adam. I tried a lot of learning rates,but no value helped or proved better results.
I am wondering if my arhitecture is bad or the problem is too hard with what I proposed. I used arhitectures which I met some read papers.
I need some advices because I don’t know where to move from here. Thank you so much for reading this, I know it s pretty long
Sorry that I don’t have a proper answer and only more of an opinion piece on sentiment analysis
Traditional sentiment analysis is often just about the polarity/valence, e.g., three classes: positive, neutral, negative. And even then, the results of state-of-the art solutions might still be not much higher than 60% for some datasets. So having 12 classes makes it much more challenging particularly given that you treat each class as independent which they are not here.
Without seeing the data, I would assume that lyrics using often flowery/poetic language, a lot of subtext, stylistic devices, etc, where the correct sentiment is not necessary explicit given the text but more implied by the listener. For example, in “This morning I was happy. Then I woke up.” we kind of know that the singer was only dreaming about being happy. A sentiment classifier would probably say that is positive, though. This is why a song about stalking has become one of the most popular wedding songs (Police - Every Breath You Take)
Sentiment is highly subjective. Given a text, different people often assign different polarities based on their interpretation. For example, “I wish I would be with you” is for some positive (singer is in love) and for others negative (singer hates to be separated), and both can are right for their point of view. When you introduce arousal, things get only more complicated.
Where does the dataset come from? Who annotated it? Multiple people, and if so, what was the interannotator agreement? These are all important questions for sentiment analysis.
Be careful with pretrained word embeddings. For example, in Word2Vec two words are similar if they are used in the same context. But this is often true for antonyms. For example, both “ugly” and “beautiful” are commonly used to describe appearances, so they are kind of similar w.r.t. to Word2Vec but obviously have different polarities.
You don’t mention it, but in case you do remove stopwords, don’t remove “not” “n’t” and stuff. They are pretty important for sentiment analysis :).
Given that you want to do regression (or 12 classes), 11k samples might not be that much. Can you at least overtrain your networks, maybe even with just 100 samples? That is, can you get your training accuracy to almost 1 for a small dataset. That is the first and easiest test to see if the network is training anything meaningful.
Have you tried more traditional ML approaches for regression or classification. For text (and smaller datasets), an SVM might perform much better then a neural network. And you can better interpret any errors the classifier is making
What dies “I am using as input 4 lyrics for each song”? Do you mean 4 lines of each song?
Some things that I would do to see what’s going on:
Double-check that the preprocessing does not remove anything important for sentiment analysis (particularly anything related to negation).
Simplify the annotations to a 3-class problem and check if the results make sense
Try to overtrain the model(s) with a small dataset to see if the training accuracy goes to 1.
In case there annotation from multiple users, only use those where all or most annotators agree on to avoid highly subjective/controversial lyrics.
Well, I hope there was anything useful in my rambling.
Thank you a lot for this answer. It was very precise and helpful. I finally reduced the problem to a 4 class problem and managed to get a better performance. Sorry for the late answer