It’s not what you say, it’s the way you say it. Ever since the invention of the Smiley, people have sought ways to represent their emotional state in the online world.
In the type of online interactions where natural language interaction technology is used so successfully today, the intent of a message is expressed using words.
Website bots and mobile speech agents use NLI to interpret the meaning of a message by semantic analysis of the words.
But interpersonal communication extends far beyond words. A human listener can detect emotions in the speech or using facial expressions or body gestures of the speaker.
That non-verbal information enables, for example, airline staff to correctly handle a passenger who replies “great!” on being told that their flight has been unexpectedly cancelled.
The unfortunate passenger may be worried, resigned or even angry – but they are unlikely to be happy. The airline agent recognizes that “great!” in this context, is being used sarcastically and does not mean that the passenger is satisfied with the situation.
At first sight, it might appear a formidable task to train computer programs to accurately recognize human emotions simply by listening to speech.
But researchers at the University of Rochester in the US recently announced they have developed a program that gauges human feeling though speech with “substantially greater” accuracy than existing approaches.
Interestingly, the program does not look at the meaning of the words.
“We actually used recordings of actors reading out the date of the month – it really doesn’t matter what they say, it’s how they’re saying it that we’re interested in,” says Wendi Heinzelman, professor of electrical and computer engineering at the University of Rochester.
The program analyzes 12 features of speech, such as pitch and volume, to identify one of six emotions from a sound recording.
Emotion affects the way people speak by altering the volume, pitch and even the harmonics of their speech. Of course, when a person speaks, the listener is not consciously trying to identify these individual features. Humans just know what “angry” sounds like, particularly if it the emotion is expressed by someone you know.
To train the computer to recognize emotions, 12 specific features in speech were isolated and measured in each recording at short intervals. The researchers then categorized each of the recordings and used them to teach the computer program what “sad,” “happy,” “fearful,” “disgusted,” or “neutral” sound like.
The software then analyzed new recordings and tried to determine whether the voice in the recording portrayed any of the known emotions.
The system achieved 81 percent accuracy – a significant improvement on earlier studies that achieved only about 55 percent accuracy, the researchers claim
Previous studies have shown that emotion classification systems are highly speaker dependent; they work much better if the system is trained by the same voice it will analyze.
This new research confirms this finding. If the speech-based emotion classification is used on a voice different from the one that trained the system, the accuracy dropped from 81 percent to about 30 percent.
The researchers are now looking at ways of minimizing this effect, for example, by training the system with a voice in the same age group and of the same gender.
Clearly, there are still big challenges to be resolved before such a system could be used to reliably recognize emotions in real-life situations involving a wide variety of anonymous speakers – an airline call center, for example.
But in much the same way that the NLI technology in a smartphone improves as it learns the characteristics of its owner’s voice, a smartphone that incorporated emotion detection could learn to accurately recognize the unique characteristics that acts as tell-tale signs of its owner’s emotional state.
That opens up a wealth of possible applications that exploit the increasingly intimate relationship people have with their smartphones.
For example, Instead of having to manually select the emoji that represents your current emotional state on Whatsapp, the mobile phone with emotional detection could automatically do it for you.
More sophisticated applications would evolve from wedding the power of natural language interaction with emotional technology to understand both the verbal and emotional content of messages.
For example, if the smartphone detects that owner is in a happy state of mind and has just had a long rambling conversation with a friend, it could suggest the owner calls another friend they haven’t spoken to for a while.
Conversely, if the owner is sending a lot of work messages and their voice sounds tired, the system could pop up a “why not take a break?” message.
More on the University of Rochester project here