Into NLP 5 ~ Numerous Language Parts – POS Tagging
Last time we had a look at the task of text normalization as a way of simplifying matching and searching for certain words. I mentioned that for the more complex normalization techniques one needs additional information: Since nouns, verbs, and adjectives are conjugated differently, it is extremely useful to identify what category you are dealing with, before normalizing. This task is called part-of-speech (POS) tagging. POS tagging is one of the “OG” NLP Tasks. Like Tokenization, POS Tagging can be found basically everywhere where there is text to be analyzed.
What is a part-of-speech?
Languages usually have different types of words. For example:
– Nouns that point to certain objects, people, or concepts
– Verbs that describe actions
– Adjectives that give additional information about things like size, color, edibility, mental state, or market value (at least in english)
Additionally there are things like pronouns, articles, adverbs, and many more.
The goal of a POS Tagger is to determine this type (the part-of-speech) of every token in a sentence. So you can think of it as a multi-class classifier for the tokens in the context of a full sentence.
Um… Okay… But why?
The reason having the part of speech is useful is because – say it with me now – language is messy. Two words can look identical but mean totally different things.
Take “The lions hunt the gazelle. It will be a big hunt.” The word “hunt” is used in two different ways: Once as a verb and once as a noun. As mentioned this knowledge is important if we want to normalize the text, but it can also be useful for search tasks:
We recently had a case where we needed to find instances of the word “(to) check” however in the same document there were also several references to a “check button” (so a button with the text “check” on it). Just a search for the word “check” would yield way too many false positives. Luckily these two words are noticabbly distinct after applying a POS tagger. Since the “check” of the “check button” will be identified as a noun, rather than a verb.
A more fun example of this kind of ambiguity would be this:
At first glance the word drunk appears to be an adjective (in the sense of being intoxicated), but in this case it is meant as a verb (in the sense of being a liquid that is getting drunk by someone). Not a single POS Tagger got it right since it requires too much context to be understood correctly. As a result the lemmatization also failed and mapped “drunk” to “drunk” rather than to “(to) drink”. So just from these examples you can see how important a POS tagger can be in resolving ambiguity.
Sentence Structures
Additionally POS tagging can be used to understand the structure of a sentence:
Take the POS order
“<Article> <Adjective> <Noun> <Verb> <Article> <Noun>
”
Even without knowing the actual sentence, you can probably already infer quite a lot of information: In all likelihood the adjective describes the first noun, which is doing some activity to the second noun. As it turns out this assessment is in fact correct: The original sentence was “The big lion hunts the gazelles.”
All these utilities are the reason why POS tagging is a regular visitor in many pre-processing pipelines. Even some deep NLP approaches use POS tags as an input feature. In these cases either to help with disambiguating homographs (i.e. two words that are written the same way but have different meanings) like the “hunt”/”to hunt” example from earlier, or just to aid parsing long or complex sentences (like the one that I’m writing right now).
The POS-Tagset
There are many sets of POS Tags that differ from language to language but also within a language. It is useful to know the set you are dealing with. For english the most common is the Penn-Treebank tagset which is able to distinguish 36 different categories.
If we take our example from above we would get something like this:
“The<DT> big<JJ> lion<NN> hunts<VBZ> the<DT> gazelles<NNS>
”
The idea is the same. Just that we get even more information since we now also know that the second noun (gazelles) is a plural form. With our POS tag only analysis we would know that there are multiple things that get something done to them.
This should give you an idea of why these taggers are so useful, why they are everywhere and maybe even how you can use them in your searching and matching. Next time we will go even deeper and get an even better understanding of the structure of a sentence by looking at dependencies.
Sorry, the comment form is closed at this time.
Pingback: Into NLP 6 ~ New Link Project – Dependency Parser – Qualicen
26/07/2021