Last week we were talking about the basics of NLP (Natural Language Processing). Just as a recap:
NLP is concerned with how computers understand and interpret human language. It is a type of AI (artificial intelligence), and more specifically, a type of machine learning. NLP is challenging because human language is imprecise and sometimes ambiguous – it’s probably the most unstructured of unstructured data. Machines have to understand the context of the words to give them valuable meaning. In a few words, it works like this:
NLP relies on several tasks to do different things. We already talked about lemmatization, stemming, and word sense disambiguation, among others. In this post, we’ll focus on more complicated tasks that require a more complex natural language understanding from the computers.
Natural language understanding (NLU)
This is the Holy Grail of NLP, along with NLG (Natural Language Generation). NLU is when the computer understands you – and NLG is when it talks back. is one of the most complex and crucial tasks when it comes to NLP. It’s even labeled as an AI-Hard problem, meaning that if we were to accomplish a complete natural language understanding from machines, we’d be creating entities as language-smart as humans – or at least as talkative.
Machines that use NLU have different complexities. To determine this complexity, the terms of breadth and depth are used. Breadth is about the size of the vocabulary and grammar of a machine. Depth is about how similar to a fluent native speaker the understanding of a machine is.
Think of it as a system that knows a few facts about everything, that would be a shallow system with breadth. But there might also be systems that know A LOT about a single topic, that’d be a narrow system with depth. For now, we don’t have any system that is both broad and deep.
To understand the meaning of a text, it’s necessary to understand the relations between the different concepts, at least at the most basic level. Check this example:
Once the machine can extract the relations, it puts them into a semantic category. In the chart you just read, that would be the “relation” tag.
Read an in-depth explanation of relationship extraction here.
Natural language generation (NLG)
The name says it all!. Using databases, algorithms – and all of the tools we already discussed in the past post – it generates human language.
As simple as that reads, it’s also very complex. Just picture yourself trying to write down an important e-mail or maybe think about a time you had to give complicated news to somebody. It’d probably take you a long time to put into words, either written or through speech, whatever you’re thinking. Well, Natural Language Generation tries to do precisely that: turn ideas into text.
You give a question, the computer has to answer. The thing is that some questions are easier to answer than others and it also depends on how the question is stated. For example, it’s easy to answer the date that Mark Zuckerberg was born. It probably wouldn’t be as easy to answer a question about his belief in the value of Facebook in society.
Just like me or you, a machine needs information to answer questions so it usually has a knowledge base or it takes answers from other types of unstructured data like essays, news reports, etc. Of course, the answers will depend on the information the machine has. Again, to get back to a human example, if you ask a person in Latin America about their favorite football team, they’d talk about soccer. If you ask the same question to a person from the US, they might mention an NFL team. The answers will depend on the context or talking about machines, about their knowledge base.
The name says it all. It can be applied to a single sentence or a whole document. And you know what it’s great for? For knowing how customers feel. That’s why it’s widely used in marketing and social media (which is a GIGANTIC amount of unstructured data!).
To use sentiment analysis the system has to classify the polarity of the data. The basic division would be something like positive, neutral, and negative.
Smarter systems would aim to identify other sentiments like surprised or angry, for example. In general, you’ll find two kinds of sentiment analysis methods: machine learning and lexicon-based.
Lexicon-based analyses score words. If the words are positive, they sum, if the words are negative, they rest. In the end, you get a final number by adding and subtracting all the numbers. The higher the number, the more positive and vice versa.
In the machine learning approach, the system needs annotated data sets, where humans tag manually the information. If the machine has tons of information, then it will learn to recognize sentiment. In some cases, both approaches are combined for a more accurate analysis.
Read about sentiment analysis on Reddit to predict approval ratings for Donald Trump here.
Great if you want to skip to the important parts of that 5,000-word article… if the system works well, of course. If not, you’ll be missing important information.
We just tried to make a summary of our last post using this tool. According to this tool, this is the most important 10% of our last post:
What is NLP
Natural Language Processing (or NLP from now on) is concerned with how computers understand and interpret human language. In these languages, word segmentation is a BIG task!
Read more about how researches approach word segmentation in several Asian languages such as Japanese, Thai, and Chinese here.
Part of speech tagging
Words can mean different things depending on where in the speech they are, this task identifies that.
These are some of the most difficult tasks since it’s about understanding the meaning of the natural language. Since many words have more than one meaning, this is crucial to understand natural language!
By the way, chingar, although widely used, might be considered a curse word… don’t use it lightly!
Here are some great examples of NLP.
Cool examples of uses of NLP and the tasks you read before
As said before, NLP is used to analyze unstructured data.
So as you can see, it’s not very accurate. That’s because it’s probably just extraction-based summarization. Meaning that it just takes phrases from the original post and puts them together in however way the system thinks make sense. To know this, it considers things like headings or just the first sentences of the paragraphs. A more complex approach would be using abstraction-based summarization.
And now, some examples so you can see NLP being used!
It sounds like something from CSI but it’s true: the company Cellebrite helps solve crimes using AI and NLP. One of their products analyzes data to gather insights faster… imagine having many, many texts in the phone of a suspect and having to analyze that unstructured data! NLP can do it faster and better.
Read about how AI and machine learning is helping solve a wide range of crimes here.
You didn’t see this coming, right? What they do is to classify what is spam and what is not. It sounds like a very basic task but, of course, by now you know even these kinds of tasks require more things happening in the background. For this, a system might use lemmatization, and part of speech tagging, for example. Most of the spam filtering systems use Naïve Bayes classifiers.
Imagine what it takes for a machine to first understand what a text says and then translate it to another language! For machines to do this, there has to be a lot of context involved, this means… there must be a lot of tasks (the ones we’ve already talked about) to be accomplished. It’s not just about translating word per word, it’s about understanding a text, translating it, and then turning that translation into text that makes sense in another language. Here’s an explanation of it. Although it doesn’t explicitly mention all the things we’ve written about, you’ll be able to identify them.
Royal Bank of Scotland
Who loves banks? Nobody. Banks get complaints every day in different forms and of different types. The Royal Bank of Scotland uses all this unstructured data to analyze the feedback of their clients (sentiment analysis is involved!). They take this information from mails, surveys, and even call center conversations. By doing this, they can identify what the problem is and implement changes to have better customer service. Go here to watch a video on how they do it.
Every day, hundreds of patent applications are filed to the USPTO (US Patent and Trademark Office). It is a lot of information. AI and NLP can also help analyze this so patent examiners can well, examine the patents in a more effective way. Read more about it here.
Every other task of NLP is present in the machines we use every day. Now that you know about the tasks, you’ll appreciate it even more when Google translate helps you understand the world a bit better!