If you recognize this journey, it’s because Siri, OK Google, Alexa, and Cortana use it. Now you’ll understand better HOW they do it. Let’s start with why you would use speech to communicate with a machine.
First of all, you cannot expect a machine to fall in love with you, like in Her, the movie. We’re far away from that still. But you can still talk to your devices – and have them talk back.
One of the appeals of “speaking” to a machine is that it comes naturally to us, it’s faster, and you don’t need special skills to give orders to a system. Many people are better at speaking than writing, so being able to be understood by just talking is a great advantage! Also of importance is to mention that machines that understand speech can be super helpful for people with certain disabilities.
But let’s get back to Her, the movie. Part of the reason it’s so easy for Theodore, the protagonist, to fall in love with Samantha, the OS, is because he can talk to her like a normal person. Everything he says to this OS doesn’t sound at all like commands, they sound like something you’d say to a friend (or lover, wow❤️). In one part, Samantha, asks if they can move forward with deleting some mails. Theodore just says, “Yeah, let’s do that.”
For a machine to understand that yeah means yes, and let’s do that is an affirmation, it has to recognize the words first.
That you can speak to a computer and be understood by it is accomplished thanks to speech recognition. You’ve done it several times by now, but here’s how it works (think about it next time you speak to Alexa!).
Every time you speak to a device, it converts the sound of your voice into a wave. The device filters the background noise and normalizes the volume of your voice. Many devices use neural networks to simplify speech. Also, some systems use voice activity detectors (VADs) to reduce the audio to only the parts that contain speech. Once the wave is “normal,” it’s broken down into phonemes. Then, based on the first phoneme, the device uses statistical probability to find out what you said. Most modern devices use a Hidden Markov Model (HMM) to do this. Here’s a detailed explanation of how HMMs work if you’re interested in knowing more.
This model divides the wave into 10-millisecond parts and turns them into vectors, which are grouped and matched to one or more phonemes. Phonemes are the fundamental part of speech, by the way. Finally, an algorithm is applied to narrow down options and determine the words you said. This algorithm needs training since we all say things differently, we have different accents, and even depending on our mood, we speak in a different way.
Since training is done by humans, it can be biased. This is not something completely bad since this training can be adapted to understand a particularly difficult accent, for example. In other cases, it can make things more complicated for people. A case to mention here is how voice-controlled systems in cars most of the time are trained to understand better the voice of men, making it for women more difficult to control them. You can read more about it here.
Now you know that your voice is turned into text. Also, these files could be stored. For example, Amazon keeps recordings of almost everything you say to Alexa, which is very creepy (on that link you can also find out how to delete those files)! Also, if you visit your Google activity, you’ll notice there are a lot of voice files from you.
Speech recognition turns your voice into text… so then how does the device understand what you’re saying? Yes, NLP!
Natural Language Processing
One thing is to know what a human has said and the other thing is to understand what the human meant with it. In the case of Theodore talking to Samantha and saying, “yeah, let’s do that” the machine has to understand that the context in which these words are being said is a conversation about e-mails and that by saying those words, Theodore wants to delete some e-mails. A lot of NLP tasks are used in this process. Some of them include lemmatization, or stemming, which you can read more about here. Others include more complex NLP tasks like natural language generation or natural language understanding, which we talked about in more detail here.
After recognizing the words, the device has to analyze the language and, more importantly, determine if any action should be taken. The actions to be taken are usually pre-programmed, that’s why you can’t do EVERYTHING with Siri, for example. That’s also the reason why, when you start using a device with speech, it has pre-built options, so you can know exactly what you can do with it. Like this example from Cortana. When you start using it, it suggests you should ask “what happened today” and it immediately provides an answer. You can see how, after the answer, Cortana suggests more things you can do with it.
Another example. If you want to play a specific song, from a specific artist on a specific app, you can ask your personal assistant in your Android phone, or OK Google, for short, for it. This order is easy to understand.
But, what if you make a vague request like “can you play a song.” There are two problems to consider here. First, that we’re not asking the machine if it can play music, we know it can. And then, to understand that what we’re asking is for the device to open a music app and start playing a song.
Natural Language Generation
As we explained before, Natural Language Generation (NLG) is one of the most complex tasks in NLP. NLG tries to turn “ideas” from the machine into text. If you’ve been paying attention, which we’re sure you have, you’ve seen that everything that the machine responds is turned into text (and voice, but we’ll get to that later). Here’s a diagram so you understand how it works:
To read more information about how NLG works, you can visit this site and this site.
And finally, you listen to your device speaking to you.
Every time your device “writes” something to you, it also speaks out loud. This is text-to-speech. But text-to-speech is used in many more ways than just assistants like Siri, Cortana or Alexa. Text-to-speech tools are useful for those who can’t read, or have disabilities. You probably already have Stephen Hawking on your mind, speaking through a computer.
Let’s talk more about each part.
1. Text to words
NLP is used to understand what the words mean in each sentence. This is particularly important when there are homophones in the text, like “I read the red book last Tuesday.” In this example, read is pronounced the same as red because it’s in the past tense. If a computer doesn’t understand the context, then it might say it as if it was in the present tense.
2. Words to phonemes
Phonemes, as we said before, are the most basic unit of language, we’re talking about the sounds of p, f, or any other letter. The complexity if this is that many languages have more phonemes than letters. English uses 26 letters but has 40 phonemes. There are other languages like Italian where the sounds of the letters are consistent (you can read more about this interesting topic here).
3. Phonemes to sound
Here’s where the final result starts taking form. Basically, there are 3 ways in which a computer could try to convert phonemes into sound:
Computers can try to make sounds and say words although they sound robotic, just like Stephen Hawking. This works similarly to music synthesizers.
Machines can use human voice samples or recordings of sections of human speech to create their own sentences. You’ve heard this in airports and this company might be the responsible one.
Machines can try and mimic how humans produce their voices. In this case, words can sound pretty natural but when a whole sentence is said, it sounds off. This is because machines lack intonation… and general human personality when they try to say things. The good news is that you can program machines to change this, to raise the pitch when there’s a question or to differentiate when a word is written in capital letters. Of course, this is an area of investigation and there are several applications that sound more and more human. An example is this one from Google.
And the cycle is complete!
Applications of all we’ve talked about are almost endless, the limit is the sky… and the current technology available. By now, you already know that Siri, Cortana, Alexa, and every other personal assistant works this way. But let’s look into other examples.
This Hungarian company wants to make communication between hearing and deaf people easier. Their system interprets sign language and turns it into text. Also, it “hears” people, and turns their speech into text so the deaf person can communicate with them. It works like this: