Word Spotting vs. Phonetic Search vs Speech Recognition

We explore the differences between word spotting, phonetic search and speech recognition.

Functionality

A customer defines one or more keyword lists. At the same time he defines which calls should be analysed by the speech engine (already existing or future calls).

The lists will then be transferred to the engines together with the call data. The engines analyse the audio data and deliver the result (“In which calls will which keywords be detected and with what confidence level?”).

Basic Approach

The creation of a keyword list follows the intention to detect calls relevant for the user out of all recorded calls.

Examples:

Calls in which the name of a competitor was mentioned
Complaint calls
Calls in which product names were mentioned, for which a marketing campaign is in progress
Calls which concern a special process, e.g. the unlocking of sim-cards

Added Value – QM

Keyword spotting can be used to filter the relevant calls out of all the recordings. This might be done for one of two purposes: on the one hand to discover critical calls in order to locate specific needs for improvement, or on the other hand, to filter good or poor calls for use as “best/worst practice” examples for training purposes.

Example:
In line with a qualification measure, training is provided for handling difficult customers, especially those who report a complaint or even contract out of an existing agreement. The coach would define a keyword list and analyse the calls of the agents. Then if necessary a coaching module can be sent to the agent.

Added Value – Process

For training managers or supervisors finding a representative choice of authentic calls for the evaluation of a specific process is very complex. With a dedicated definition of keywords this pre-selection can be fulfilled by the system.

Example:
“Why has the Call Handling Time (CHT) increased during the processing of SIM-Lock Calls?”
Implementation in keyword spotting:

Search for calls with the keyword list: “Mobile, Code, Simlock, Problem, Unlock”
Listen to the calls and record into a data collection plan all calls where the CHT has increased, e.g. during the unlock procedure of Nokia mobiles
Perform a new search for calls with an enhanced keyword list: “Nokia, Code, Simlock, Problem, Unlock”
Listening to the located calls showed why Nokia calls were taking longer: in this case the entering routing at Nokia mobiles was too complex

Measure:
Develop process instructions (in this example, giving agents guidelines for explaining to the customer how to enter the unlock code, collaboration with Nokia on which steps have to be observed, e.g. how fast the customer can repeat a specific button push.

Added Value – Marketing

After the start of a campaign the marketing manager wants to check if central statements of a marketing campaign are being recognised by the customer and if the predefined goal of the marketing campaign is being reached.

Example:

A special low-cost bundle of DSL-connection and WLAN-router is advertised.
User defines a keyword list with: Call, Surf, 2000, 4000, 6000, Order, Application, Installation, Router
The engine analyses the calls and gives the marketing manager the possibility of controlling whether the customers asks specifically for the Bundle or the WLAN-router.

Additional advantages:

Possibility of a near-term optimisation of marketing actions
Success control: Do I reach my customer?
Search for calls in which competitors were mentioned, with the aim of competition analysis

Speech to Text (Speech Recognition)

In contrast to keyword spotting, with Speech to Text Transcription the WHOLE call will be available in its textual representation. Basically the detector conducts a word classification, that is to say it searches for the best possible performance.

The transcription is performed on recorded calls, not in real time. During the call replay the transcription is displayed. Parallel to the call replay the relevant text passage will be highlighted. The audio replay can be navigated by clicking on the highlighted text passage.

Because the whole text is saved in the database, any combinations of search items can be executed to find relevant calls.

For more great insights into how speech recognition is being used by contact centres, read at our article: What Is Speech Recognition Software and How Is It Being Used by Contact Centres?

Implementation

The textual representation of the call can be displayed, with the text synchronised to the replay.

The display of the calls differs between the different speakers. Furthermore it is possible to fade out sensitive data (names, customer information, etc.) according to the muting inside the call.

Scenarios

Basic Approach:
Basically the transcription covers all search and filter possibilities of keyword spotting. Beyond that, transcription provides the following advantages:

Fast detection of the call content (faster than listening to a call)
No restriction to predefined keywords during the search
Basis for advanced analysis operations: content analytics, categorisation, call model detection
Basis for automated evaluation process
Transfer of the call contents into business information and data warehouse systems (unstructured data will be structured and analysed)

Added Value – Process Management

The transcription allows the possibility of structuring the unstructured call data, so that it can be easily imported and processed into other systems. This means that the content can be processed in ERP or CRM systems to analyse the different steps of the customer contact process and to discover possible problems in the operation.

The transcription can be used as the basis for advanced content analytics through data mining. In addition call pattern detection and content analytics are an important source to discover the “stumbling blocks” in work procedure.

Added Value – Marketing

Transcription means that unstructured call data can be structured to execute analyses associated with a campaign.

Data mining aligned to the marketing issues can give important information about whether marketing measures are working.

Example:
Searching for special marketing slogans inside the call (text); looking at the request and output of the following customer’s reaction.

Phonetic Search

In the basic speech recognition approach described above, the recogniser tries to transcribe all input speech as a chain of words in its vocabulary. Keyword spotting is a different technique for searching audio for specific words and phrases. In this approach, the recogniser is only concerned with occurrences of one keyword or phrase. Since the score of the single word must be computed (instead of the entire vocabulary), much less computation is required. However, this approach does not work in real time. Therefore a new kind of keyword spotter, known as phonetic-based search, has been developed that performs separate indexing and searching stages. In so doing, search speeds that are several thousand times faster than real time have been successfully achieved.

Phonetic-based search is designed for extremely fast searching through vast amounts of media, allowing search for words, phrases, jargon, slang and other words not readily found in a speech-to-text dictionary.

Indexing

In the first stage the input speech is indexed to produce a phonetic search track (or lattice). This need be done only once. Using an acoustic model, the indexing engine scans the input speech and produces the corresponding phonetic search track. An acoustic model jointly represents characteristics of both an acoustic channel (an environment in which the speech was uttered and a transducer through which it was recorded) and a natural language (in which human beings expressed the input speech). Audio channel characteristics include: frequency
response, background noise and reverberation. Characteristics of a natural language include gender, dialect and accent of the speaker.

Note that once indexing has been completed, the original media are not involved at all during searching and the search track could be generated on the highest-quality media available for improved accuracy (for example: μ-law audio for telephony), but then the audio could be replaced by a compressed representation for storage and subsequent playback (for example: GSM) afterwards.

Searching

The second phase, performed whenever a search is needed for a word or phrase, is searching the phonetic search track. Once the indexing is completed, this search stage can be repeated for any number of
queries. Since the search is phonetic, search queries do not need to be in any pre-defined dictionary, thus allowing searches for proper names, new words, misspelled words, jargon etc.

A phonetic dictionary is referenced for each word within the query term to accommodate unusual words (whose pronunciations must be handled specially for the given natural language) as well as very common words (for which performance optimisation is worthwhile). Any word not found in the dictionary is then processed by consulting a spelling-to-sound converter that generates likely phonetic representations given the word’s orthography.

After words, phrases, phonetic strings and temporal operators within the query term are parsed, actual searching commences. Multiple phonetic search track files can be scanned at high speed during a single search for likely phonetic sequences (possibly separated by offsets specified by temporal operators) that closely match corresponding strings of phonemes in the query term. The matching algorithm is probabilistic, allowing scores representing the confidence that the match is correct to be produced. This means that system users can specify confidence limits as part of their search to optimise the hit rate (i.e. so that it finds as many of the correct terms as possible, while avoiding false hits )

Advantages of phonetic searching over speech-to-text and keyword spotting

Speed, accuracy, scalability. The one-off indexing phase allows high accuracy so that the searching phase can make better decisions when presented with specific query terms.
Open vocabulary. Speech-to-text systems can only recognise words found in their lexicons. Many common query terms (such as specialised terminology and names of people, places and organisations) are typically omitted from these lexicons (partly to keep them small enough that recognition can be executed cost effectively in real time, and also because these kinds of query terms are notably unstable as new terminology and names are constantly evolving). Phonetic indexing is unconcerned about such linguistic issues, maintaining completely open vocabulary (or, perhaps more accurately, no vocabulary at all).
Low penalty for new words. Speech recognition lexicons can be updated with new terminology, names, and other words. However, this exacts a serious penalty in terms of cost of ownership because the entire media archive must then be reprocessed. The dictionary within the phonetic searching architecture, on the other hand, is consulted only during the searching phase, which is relatively fast compared to indexing. Adding new words incurs only another search, and it is often unnecessary to add words, since the spelling-to-sound engine can handle most cases automatically, or users can simply enter sound-it-out versions of words.
Phonetic and inexact spelling. Proper names are particularly useful query terms—but also particularly difficult for speech recognition systems, not only because they may not occur in the lexicon as described above, but also because they often have multiple spellings. With phonetic searching, exact spelling is not required.