There has been a lot of hype recently about emotion detection. Many people have been unclear how well it works, or if it even exists!
We asked our panel of experts for their opinion.
The business view
According to Wikipedia the word ‘emotion’ means a subjective experience, associated with mood, temperament, personality and disposition. So surely the first question must be, is it even possible to determine a customer’s mood and emotion based solely on the sound of their voice, particularly when this is something which is open to so much interpretation?
Granted you are likely to detect ‘stress’ in a voice by setting parameters in areas like tone, pitch, pace, volume or lack of it, but getting to the root of the emotion behind that may prove more difficult, particularly if it is an automated software process rather than a human judgement call.
Some customers are very cool, calm and collected when conversing, although deep down they may be angry. They may be able to clearly outline within the call what will happen if their problem is not rectified and these calls may go undetected when applying emotion detection.
However, they would quite probably be picked up when employing phonetic indexing to search for key terminology traditionally used in an ‘unhappy customer’ scenario.
The real question is not “does emotion detection really work?” but “do the benefits associated with enabling emotion detection justify the costs?” And to that question the answer at the moment is “no”.
The prerequisite for enabling emotion detection is to use stereo (dual recording) which at the present cost/benefit trade-off may not be justified.
Emotion detection doesn’t necessarily give you all the answers
The other key point to make is that knowing whether someone is agitated or not is not enough. So what if you search across 1,000 calls and 73 of them reveal a level of emotion which is flagged? Further analysis is still required to find out why and what has led to that situation.
Similarly, there may be a number of false alarms, whereby a lot of “stress” is detected in a call, but the call outcome remains reasonably positive and was handled relatively well and does not really offer that much more insight into the business.
The lie detector?
In the early years, emotion detection was touted as a possible tool to help combat fraud, particularly in the area of fraudulent insurance claims. But the reality is that it is not a replacement lie detector; when looking at the frequency range of the human voice, it is generally understood that electronic telephony can only handle around 20 percent of what is said.
That said, a significant number of our customers still ask us “can you supply emotion detection?”, predominantly because it is perceived as “flashy and exciting” and often used descriptively when explaining the concept of speech analytics for the first time. The answer is yes, but we ask them why or how they want to use it, and invariably another tool can provide the same result without the expense of emotion detection.
To conclude, it’s here, it works but there are probably more useful methods which can be deployed to deliver better returns.
There is little commercial value in detecting emotions
Research has shown that there is little commercial value in detecting strong and obvious emotions such as anger, great joy, etc. Firstly, compared to the overall volume of calls in a contact centre they seldom occur, and secondly, when they do occur their cause is known or easily identified, so the benefit in detecting them on their own is low.
Better to look for keywords and phrases
Of far greater benefit is detecting the dozens of subtle speech components that make up an emotion and analysing these and their changing relationships to each other as the call progresses. In parallel the triggers for these changes such as the use of specific words/phrases, the time of day or the profile of staff used can be identified. These can then be assessed against the desired call outcome to drive better results in the future.
Practical applications in the contact centre
Sentiment analysis can be an invaluable tool for improving the quality of service. 
The scientific view
Emotion lies behind much of the richness of human life and is one of the main drivers behind our choices and decisions. Accordingly it is a frequent customer request that we should provide tools for “emotion recognition” by analogy with the “speech recognition” systems increasingly widely available.
Emotion is a fluid and rather slippery concept
Some people, through disability or upbringing, find it difficult or impossible to understand and/or express how they feel. Likewise, interpreting someone else’s feelings is an art rather than a science.
In order to avoid attributing hard categories, many researchers prefer to use a continuous space such as that in Figure 1 rather than discrete labels to describe emotion.
Figure 1 A two-dimensional representation of emotion, derived from 
The advantage of this representation is that it is possible to express as numbers the continuous scale from “mildly irritated” to “incandescent with rage” and also to capture the shades of grey between related pairs of emotions.
For many years, speech recognition developments have benefited from the availability of common databases, allowing relatively easy performance comparisons among different approaches developed by different laboratories. It is only within the past year that the first such competitive evaluation has been undertaken in the field of “emotion recognition” 
The task of recognising emotions from audio alone is even more difficult than that of speech recognition. Much of our perception of emotions is visual: facial expressions and body language contribute as much as half the emotional information .
Therefore, the task of “emotion recognition” today is roughly where “speech recognition” was 20 years ago. Some applications are possible, but only in strictly limited areas or under highly controlled conditions. By acknowledging those limitations and adapting to them, some commercial applications may be feasible, but it is likely to be some considerable time before an effective system for general-purpose interpretation of real emotions from audio is available.
People express emotion in different ways
Equally, another problem with detecting emotions is that no two people display them in quite the same way; when I get angry, I may shout, when you get angry, you might go quiet. But then again, when I’m happy I might shout as well so what does that tell us? Not a lot really.
One section of the manual analytics that we do for customers asks our raters, real human beings, to make judgements on the mood of the customer amongst other things and it’s a very skilled thing to do with any certainty. Making a judgement of someone’s mood and scoring it on a scale of -4 to +4 is a skill that takes some time to learn and there will still be differences of opinion as to the “right” score, if indeed there is one.
Another fundamental challenge of emotional analysis is that the same indicators in two different people can mean two different things. For example on similar campaigns an elderly person might speak slowly and softly while a male teenager may talk quickly and loudly. This does not mean that the teenager is angry or the elderly person content, they simply have different ways of expressing their emotions.
It is vital to assess the harmony between the caller’s and agent’s style of speech. This helps answer the question of whether you have the right agents with the right soft skills on the right campaigns, so that they can easily tune in to the caller’s style of speech and mirror their behaviour, making them feel at ease to achieve the desired outcome.
On a single call, if the indicators of emotion fall outside norms established by the algorithm and/or there is a lack of harmony between caller and agent it can be escalated as it happens for immediate intervention by a more experienced member of staff. Equally, multiple instances would suggest that the campaign itself needs attention.
So to answer the question, emotion analysis most definitely works and its ability to achieve progressively more sophisticated analysis will only improve over time.
The acoustic approach
The acoustic approach relies on measuring specific features of the audio, such as tone of voice, pitch or volume, intensity, rate of speech. The speech of a surprised speaker tends to be faster, louder and higher in pitch while that of a sad or depressed speaker tends to be slower, softer and lower in pitch. An angry caller may speak much faster, louder and will increase the pitch of stressed vowels.
To create a database of defined sentiments against which ‘live’ audio can finally be evaluated and thereby deliver sentiment analysis, each single-emotion example is pre-selected from a ‘pristine’ set of recordings, manually reviewed and annotated to identify the sentiment it represents. Even in this pristine environment less than 60 percent of single-emotion, noise-free utterances can be correctly classified.
In the real world the call centre suffers from background noise, network interference and background talking – all of which substantially erode this percentage. Also the quality of the audio can significantly impact on the ability to identify these features. Compression methods make it very difficult to detect some of the most commonly sought features – such as jitter, shimmer and glottal pulse – even further degrading the results from this form of sentiment measurement.
Blended emotions are difficult to classify
This is compounded by the fact that speakers often express blended emotions, such as both empathy and annoyance which are tremendously difficult to classify. Additionally, sentiment analysis is often incapable of adjusting for the varied ways different callers express the same emotions, for example, people from the North East or Scotland might be brusquer while callers from the South West tend to be more polite even when displeased. These limitations highlight its non-viability as a business analysis tool.
Figure 2: Example of the linguistic, structured-query approach to sentiment analysis
Can the technology deliver?
There is much debate in the contact centre industry, supported by many promissory statements, about the capabilities of emotion detection. The debate covers issues such as how it will allow organisations to detect frustrated or angry callers and provide insight into a customer’s feelings about an organisation, product, or services and thus allow companies to develop early corrective measures to improve customer relationship management.
All of these questions are about insight that is critical to businesses, but can the technology deliver?
An interesting starting point is to ask your friends and colleagues to firstly define emotion, then create a list of emotions and then explain the differing ways that we express those emotions. Typically the answers create a complex set of answers and automatically classifying them accurately is a challenge.
Looking for trends in levels of agitation
Measuring ‘emotion’ is challenging because people have their own distinctive ways of communicating and there are many variables that affect their emotional state beyond the current conversation. In our experience, an effective measure of agitation can be created by detecting changes in the stress levels and speech tempo of the conversation. Higher levels of change in stress and tempo are normally associated with a higher level of agitation.
This measure of agitation is increasingly meaningful when it is observed across a significant body of calls. By looking at a larger number of calls the random variables smooth out and the agitation measure becomes more useful. For instance, in a full week’s worth of calls, the average agitation on the calls handled by certain agents, or about certain topics, will be consistently higher than other agents or topics. This analysis can provide valuable evidence about what factors are affecting the customer experience.
Using agitation to drive marketing feedback
One of our customers used agitation as a part of the analysis of a major price change announcement. Since the price was being reduced, they were surprised to observe a significant group of calls showing high agitation in conversations where the topic of the price change was discussed. Further analysis showed that a significant number of their customers who had recently signed up to fixed-price contracts were calling to cancel these contracts and were upset that they couldn’t now enjoy the benefits of the price change. The use of speech analytics enabled them to identify and quantify this issue, and be prepared for this with future announcements.
Measuring the nonverbal content of a call is not a ‘silver bullet’ but it does provide a valuable extra dimension for understanding customer experience when combined with the full range of information available from speech analytics.
Why does everyone claim to do emotion detection?
Emotion detection seems to be seen as a key element or differentiator of speech analytic solutions and for that reason was described to me once as “well everyone else says they can do it, so we have to say we can do it even though none of us really can”.
I’m sure that makes sense to a VP of Marketing somewhere. To my knowledge, and I am always willing to be corrected, anything that purports to be able to detect emotion in any currently available solution has to be defining emotion in a very simplistic way.
One reason why this has to be lies in the definition of emotion itself. Are we looking for “emotion” (which is what you feel) or “the expression of emotion” (which is what is displayed to other people)? We know that there are quite complex relationships between the core emotion and the expression of emotion, with variations being due to factors such as culture (different cultures have different “display rules”), situations (it’s OK to show some emotions in some situations, not in others), status differences (who the other person is), and the individual.
All in all, we know that there can be tremendous variability in the relationship between emotion and displayed emotion, whereas obviously automatic detection depends on being able to write a set of invariant rules.
A solution to a non-existent problem?
In my view, emotion detection was a solution to a problem that no one had and was dreamed up somewhere by someone as a ‘killer feature’. Surprisingly enough, problems can be identified and cross or happy customers can be detected through understanding the words and phrases that customers use when they are cross or happy.
Once again, you don’t have to over-promise something that doesn’t exist, so why do vendors continue to persist?
- R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey, M. Schröder, “Feeltrace: An instrument for recording perceived emotion in real time”, in: Proceedings of the ISCA Workshop on Speech and Emotion, Newcastle, Northern Ireland, UK, 2000.
- A. Mehrabian, Communication without words, Psychology Today 2 (9) (1968) 52–55.
- B. Schuller, S. Steidl, A. Batliner, The Interspeech 2009 emotion challenge, in: Proceedings of Interspeech, Brighton, UK, 2009.
- For a more detailed description of the accuracy achieved by this approach see, ‘Phonetic Search Technology’ white paper by Nexidia Inc. http://www.nexidia.com/technology/phonetic_search_technology