Label Unlabeled Text Data

Hi, What is the best way to label unlabeled text data. text data is the Facebook Comments for certain posts.

The data structure is this: post , comment

and i need to label comments as positive, negative, or neutral.

One way i thought would be to use emojis that are in comments. for example, comments, where heart emojis or something like that are used, would be positive comments.

is there any other way to label this kind of data?

I checked some ways but what is the best way that u have experienced?

It seems that you need to make an automatic annotation.

  • The easiest way is to use a lexicon-based sentiment analyzer. However, it was not very efficient when I tried it. Lexicon-based approaches use a sentiment dictionary with its related tag. For example, +1 for the term ‘good’ and -1 for the term ‘bad’.
  • I think annotating using heart emojis or something like that is a good idea. It may be efficient enough. However, deciding about neutrals may be problematic if you need them. You cannot be sure about post without emojis as anyone can write positive or negative feelings without using emojis. You may think about making experiments with positive and negative ones. There are many studies using only these two types of sentiments.
  • If you have a chance to use a machine learning model trained with many data, it may provide also a good solution. For example, Sentiment140 dataset including 1.6 million tweets is automatically constructed using machine learning approach.

If there were a superior automated method to label the data that didn’t involve machine learning, why not just use that method in your application?

My point is that the best route for labeling a new dataset is either using a pretrained model for sentiment analysis or hiring people to do it. In the first case, you may as well just deploy that model for your application.

The one exception to the above would be if you had obtained 2 or more distinct pretrained sentiment analysis models. You could analyze your dataset through both(or more). And set aside data samples where they disagreed for manual labeling.

Hi, Thank you for your reply.

The thing is that My data is not in English. So, Lexicon-Based Analyzer won’t work. also the third option, to use other model to label my data will not work because there is no trained model on sentiment analysis or Data in my language.

So i think im gonna go with second option. and i will probably use not only the emojis but some words as well. and maybe im gonna have positive and negative with this method and the rest of the data will be my neutral class.

I’m not sure if this is right way.

I was wondering if anyone ever used some clustering LDA and staff.

One more way, maybe is to label the comments that u certainly know u labeled that correctly and then use semi-supervised learning. That might be another way.

Hi,

Good point. Maybe They use such automated annotations on image data or something. i’m not aware of how they work. but There are also some tools that @akuysal Mentioned. IDK. bit confused. Maybe the best way is to analyze the data and write some function according to it to automate “manual” Labeling

It depends what the purpose for the labeling is. If you want to label the data to train a machine learning model – which I assume – then you strictly speaking have the data be labeled by human annotators. And since it’s about sentiment analysis – which is arguably rather subjective – you would need to get each data sample annotated by, say, 5 annotators and evaluate the inter-annotator agreement. Of course, this is very time-consuming and potentially very expensive. But again, strictly speaking the only proper approach.

There are some approaches that try automatic labeling as you described, particularly utilizing emojis and emoticons to identify the label. However, note that you then have to remove these emojis, emoticons, etc. from your samples text! For example, say you have a post “The movie was funny :).”, the then you could generated a (text, label) sample like (“The movie was funny.”, “positive”).

But in any case, all these automated approach rely on certain assumptions that may or may not sufficiently hold in practice. The biggest problem is that you potentially have to omit all samples that do not contain any emojis, emoticons, etc.

Hi.

  • Yes that’s definitely is the proper approach. but in this case, since i am in a competition I don’t think they would ask that to an individual competitor. I don’t know.

  • Well, That is new to me. so u say i should clean the text from emojis and etc… okay then i need to label positive and negative comments with the help of emojis and some key words(happy, good,congrats etc…) and then clean text from emojis. that i did not know. there are also only emoji comments in the data so does that mean that these kind of comments are useless, containing only emojis? should i not use that for training ?

  • In automated annotations i meant clustering data or something. i was wondering if anyone had tried that or not and wether it worked good or not.

-Finally, i think im gonna go with the idea of using emojis and key words to label the data manually. like automated ‘manually’ using some manually written functions. I might also use semi-supervised learning. and not label every sample of the dataset and label only the ones which i;m certain.

  • Regarding to Neutral class, i think i’m just gonna label the rest of the data as neutral once i have labeled positive and negative comments.

You can tokenize emojis for analysis.

One of the dangers of labeling based on emojis is it does not account for subtlety or sarcasm. People often leave positive emojis at the end sarcastic comments, which are actually negative, but require a proper reading to assess the nature of the comment.

Yes, you shouldn’t use the information you used for the labeling within your training. That’s kind cheating through information leakage.

That’s the whole problem: With all such automated methods you make some assumptions that might or might not hold (see also @J_Johnson 's comment about sarcasm and other stylistic devices).

In other words, any automated approach will either have to ignore some data (e.g., the ones without emojis) or will make errors w.r.t. to a ground truth – that is, a manually labeled dataset. Having some errors would even be a problem. The real problem is that you have no way to quantify how large your errors are.

Again, that’s an arbitrarily vague assumption. Consider the sentence: “I just booked my flight…I can hardly wait”. That’s arguably a positive comment but there’s not emoji, emoticon or any obvious positive word. So would likely label this as “neutral”, while most human annotators would label it as “positive”.

If it would be so easy to automatically label a sentiment classification dataset, the whole issue of sentiment analysis would be a solved problem.

of course, Annotating manually is an optimal option. No doubt. i was just looking for the best option in the bad options. Thank you for the replies, guys. I think im gonna still annotate some comments automatically, using some emojis. and the rest will be annotated manually.

Thanks.

If you want the best mix of automation and accuracy, like I said before, use multiple pretrained sentiment analysis models to label your data, and then any where they disagree can be tagged for manual review. At least 3 models would be ideal. You can find pretrained models on Huggingface:

Thanks but my data is not in English. it’s in Georgian. that’s why I want a different approach.