I want to train a Named Entity Recognition model, particularly a neural network. I have two kinds of training samples. The first kind is self-contained sentences e.g., "Apple is looking at buying U.K. startup for $1 billion ". This is an example taken from SpaCy (https://spacy.io/usage/linguistic-features#named-entities). On the other hand, the second type of examples are like this: “Musicians may enjoy Italy a scenic city that”, not self-contained sentence, that has not a proper structure. And maybe some other sentences be like this: “by a glacier”.
My dataset contains both kind of training samples shuffled. The unstructured sentences are a problem? And if so, why?
There are no principle problems. The network doesn’t “know” what a proper sentence is. It only sees a sequence of words (well, word vectors), no matter if it’s a RNN, CNN, or anything more fancy. For example, we trained are NER model for extracting named entities from keyword-based search queries – please note that this was only a proof-of-concept based on synthetic data in the context of this paper.
It’s never obvious what the network will learn. For example, anything followed by “a” or “an” is probably not a named entity. Or maybe the network just learns named entities such as “Italy” by heart.
It also depends on the context. For example, I’ve noticed that SpaCy NER is often very sensitive to capitalization. So I can only use it for formal text but not for, e.g., social media posts. In another work, we looked into NER with focus on points of interests (e.g., names of shops, restaurants, hotels, bars, parks, etc.). The problem is that many names are not as “obivous” as, say, Italy. Consider, the tweet “freshly baked makes the best damn cookies”. We can generally infer that “freshly baked” is some kind of shop or restaurant, but that’s much harder to learn for a machine.
I would simply give it a shot: train an NER model and make a standard error analysis, i.e., try to identify to most common cases where the classification fails, and see if or how you can address those cases.
Thank you very much, your answer clarified a lot.
So, the structure of the training dataset “defines” what the model learns. What I assume is, if the model is trained with a dataset which contains proper sentences, it is going to learn the structure and based on this identify entities. In the other case, where there is no structure, the model in order to identify the entity has to have already learned it by heart, because can only depend on the word it sees and not on the structure.
In my experiments I will try to give both kind of samples, with the majority consists of structured samples. I will inform you as soon as I have the error analysis.
Thank you once again.
I definitely would be interested in the results.
You might also want to try stemming and/or lemmatization. From a human perspective, it destroys proper sentences, of course. But again, the model might not (have to) care to identify named entities, and it certainly benefits greatly from the reduced vocabulary.
The model can only find patterns between the input and labels. But if that also means that the model “understands language” or at least to what degree is highly debatable. I recommend this fiery blog post :).
In short, I wouldn’t over-emphasize to importance of proper sentences.
Thank you very much once again. Following this post I read numerous articles, posts, blogs etc. I did not fully understand everything, and I think that it is impossible. But now I understand what I do not understand and discontinuities in my processes and the way I am thinking of solving a problem.