The Role of AI Science in a World of Democratized AI

Roberto Pieraccini at Uniphore explains the role of AI science in a world of democratized AI.

During the past few years, we have all seen an unprecedented increase in the democratization of technology, particularly within artificial intelligence.

Only a few years ago, building intelligent machines required a deep knowledge of the physical phenomena to model-speech, natural language, images, manuscript-and the ability to wrangle complex and sophisticated techniques, such as signal processing, statistical Machine Learning (ML) and Natural Language Processing (NLP), to cite a few.

With Geoffrey Hinton’s unlocking of Deep Neural Networks (DNNs), 2012 marked the beginning of the modern AI era.

This new era has been characterized by never-before-seen improvements in tasks, such as speech and image recognition, and has been fueled by an unprecedented growth in the availability of massive amounts of data and computer power.

During this time, we have also seen a dramatic widening of the applications of DNNs to all aspects of technology.

At the same time, DNNs evolved-thanks to the invention of the attention mechanism-into the modern transformers that gave rise to today’s widely used Large Language Models (LLMs).

The concomitance of these advances, the availability of tools and data that have become increasingly open source, and the fact that the baseline technology has become so good that, in many cases, it works right out of the box, have fueled the AI revolution of the past years.

Finally, the introduction and general availability of ChatGPT in late 2022, and the corresponding explosion of analogous, open-source technology, created a phenomenon that we can define as a global democratization of AI.

Today, pretty much anyone with basic computer skills can build compelling demos using tools like ChatGPT and Whisper, without having to understand neither the intricacies of speech, images and natural language, nor sophisticated ML technologies.

Thus, a legitimate question arises in the mind of most of us who have actively taken part, for many years, in the development of the modern computer era: what’s the value that an AI science team can bring to today’s industrial technology world?

I believe that there are three areas where AI science will add value to the exploitation of AI in the industrial world: phenomenological knowledge, understanding of machine learning and the rigor of scientific evaluation.

Phenomenological Knowledge

Phenomenological knowledge is the understanding of the characteristics of the phenomena that we are trying to model, such as speech, natural language, conversational interactions and image processing.

AI scientists are trained in one or more of these fields, and they may have developed practical experience in some of them.

For part of my career, I have been actively involved in Automatic Speech Recognition (ASR) research and application.

While I’ve always loved the problem of understanding speech, I learned early on that, no matter what, ASR is never good enough. Even though we have made significant strides over the past 40 years, ASR often does not always fulfill our and our customers’ expectations.

Why is that? The short answer is that the speech phenomenon is complex. My first book, The Voice in the Machine, talks extensively about that complexity.

Without a proper understanding of that complexity, we may fail to understand the behavior of ASR when deployed in an application and not be able to use advanced technology to cope with the idiosyncrasies of a complex signal like speech.

There are problems that are intrinsic to the speech phenomenon, and we cannot avoid them. Think about the influence of noise, accented speech, multiple simultaneous speakers (the so-called cocktail party effect), speech collected in a far-field situation, reverberation and so on.

But besides these fundamental problems of speech, which our brain seems to cope with effortlessly, there are others that derive from the intrinsic limitations of all ASR engines, such as a finite vocabulary.

“Conversational applications require not just speech, but higher levels of cognitive abstraction, like discourse, dialog and semantics (i.e., how we put words together to create meanings, converse contextually, ask questions and categorize types of questions).” Roberto Pieraccini, Chief Scientist

Even though modern ASRs work based on sub-word tokens, so every possible word can be accounted for acoustically, the underlying language model has a finite size vocabulary, and it needs to be updated with new words.

New words appear every day (COVID was not a word just a few years ago), rare names, surnames, company names, new product names and more (e.g., words like Banasiewicz, Hadleigh, Hum). When an ASR encounters an unknown word, it tries to match it to a known word.

Besides new words, there are homophones, like blew and blue or own and oh, and proper names that can be spelled differently, like John and Jon or Allan, Alan, Allen, etc.

There are short phrases that can be separated into words differently, such as uncommon people names like Joseph Hardy (which could easily become Joe Sephardi).

Additionally, numbers, dates and acronyms need to be normalized consistently. For instance, “nineteen ninety-eight,” “one thousand nine hundred ninety-eight,” “one nine nine eight” and other possible spoken forms should all be normalized as 1998.

All of that, and much more, can affect the quality of the ASR transcription—a non-normalized number, for instance, can be erroneously considered as an error by the automatic evaluation procedure.

These intrinsic speech issues cannot always be resolved by a more accurate engine, but they require an understanding of the context where the ASR is deployed and a deeper integration with the product.

For instance, in a self-serve application, the ASR needs to be aware of the questions asked by the system to the user and, in general, about the previous turns of conversation to properly bias the choice of words in case of acoustic or lexical ambiguity.

Finally, one of the main glaring limitations of today’s commercial ASRs is the lack of integration with prosodic signals, including stress, pauses and so on, that in many cases can help disambiguate competing interpretations.

There are a myriad of other problems (such as training ASR on families of languages, like English, Spanish, Hindi and Arabic and all their variants and dialects, or small-density languages, like Swiss German and several Indic languages) and phenomena, like code-switching (when words and entire phrases of one language are inserted in utterances of another language, such as English words in Hindi speech or the “Spanglish” spoken by Hispanic Americans).

However, conversational applications require not just speech, but higher levels of cognitive abstraction, like discourse, dialog and semantics (i.e., how we put words together to create meanings, converse contextually, ask questions and categorize types of questions).

Natural Language Processing (NLP) includes most of the technologies that work on any arbitrary textual representation of an utterance, or a chat, and try to extract meaningful information.

Natural Language Understanding (NLU) occupies a special place in NLP, since it empowers complex interactional systems like self-service or virtual assistant applications (as described in my second book for the general audience, AI Assistants) by extracting symbolic meaning representation from text.

Those meaning representations are generally structured as intents and their corresponding arguments (or slots). While a few legacy systems still use technology such as keyword-matching systems, ML-based NLU can provide an increased flexibility, as it can cope with the vast variability of human language; yet it requires sample training data for each specific domain.

Today’s generative AI and Large Language Models (LLMs) can enable zero-shot solutions that can work out of the box.

However, even using pre-trained LLMs, one needs to understand the influence of conversational context for the correct categorization of intents and slots, and the integration of other contextual signals derived from the knowledge about the user, previous interactions, etc.

More sophisticated applications of NLP, like Question Answering (QA), chatbots and agents, require a deeper understanding of the language phenomena.

QA is a particularly complex application of NLP, especially considering that there are no limitations on the types of questions that a user may ask (such as questions about facts, comparisons that may include calculations, complex questions that require multiple searches and reasoning).

Building a QA system and other natural language-based systems is also complicated by the common usage of explicit or implicit referential expressions that need to be resolved by the knowledge of the conversational context, such as “How much does it cost?”, “What about tomorrow?” or “I’ll take the second one!”.

Finally, the ultimate application of NLP is building virtual agents that could be described in a simplistic manner as chatbots that do things.

Agents need to understand, be able to respond, decide whether to use external knowledge or invoke actions through APIs and track the status of the realization of a user’s goal.

Machine Learning Knowledge

AI scientists are computer scientists who are specialized and have practical experience in ML techniques, either at the theoretical level or applied to one or more fields (such as speech, NLP or image processing).

ML is about creating and using techniques that learn from data to accomplish complex tasks. ML has evolved from simple pattern-matching algorithms to more sophisticated statistical models, and the recent explosion of deep neural networks is the latest iteration of that development.

Generative AI (GenAI), which is epitomized by pretrained LLMs that gained popularity with the exposure to the wider public of Open AI’s GPT LLMs and chatbots, followed by a myriad of LLM releases made public by other institutions.

The concept of a language model (i.e., a mechanism to predict the next word in a sentence) is not new.

It can be traced back to Claude Shannon and his seminal paper, “A Mathematical Theory of Communication“, where he planted the seeds of the probabilistic approach that was exploited, foremost in ASR until about a decade ago, with predictors based on word n-grams.

However, LLMs were startling for many of us who had been working for many years in ML, and particularly in NLP. Up until a few years ago, we had not seen a “next word prediction model” so powerful as today’s LLMs.

What was amazing is that the simple act of predicting the next word in a sentence can lead to conversations that are almost indistinguishable from human conversations and that may seem to reflect elements of common sense and reasoning.

“Up until a few years ago, we had not seen a “next word prediction model” so powerful as today’s LLMs. What was amazing is that the simple act of predicting the next word in a sentence can lead to conversations that are almost indistinguishable from human conversations” Roberto Pieraccini, Chief Scientist

Proper prompting of an LLM, in the same way as we would prompt a human with an immensely vast knowledge, can make it carry out tasks that, only a few years ago, would have required the design and training of complex, special-purpose machine learning models.

Prompt engineering is mostly still an art rather than a science, one that requires trial-and-error iterations using a sound evaluation paradigm to test the effect of each prompt.

However, the LLM behaviors achievable by prompting have limits, which can be overcome only by fine tuning the model using task-specific data.

While there are fine-tuning techniques, like Low Rank Adaptation, or LORA, that limit the number of parameters requiring updates, there are also several types of fine-tuning techniques based on how the tuning data is organized.

For instance, one could simply adapt the LLM to a new domain using unsupervised data or perform task tuning by presenting examples of prompts and desired outputs.

Finally, safety and factuality of an LLM are extremely important considerations, and AI scientists have devised techniques to avoid ethically unacceptable responses and reduce the phenomenon of hallucinations, or outputs not based on any input or truthful background knowledge.

Achieving that requires the adoption of techniques such as Reinforcement learning with Human Feedback (RLHF).

So, is GenAI really magic? Can GenAI solve all AI problems? Of course not, but it is a tool that can, in principle, solve many of the problems that required expert ML design in the past.

However, there is a huge gap between “I tried that on ChatGPT, and it worked” and building a robust solution that “works” nearly 100 percent of the time.

And what does “it works” mean? How can we even define “it works” with measurable objective quantities that can be used to compare different versions, choose the optimal parameters and configuration and guarantee a certain accuracy of results?

That’s one among the many jobs of AI science: a rigorous evaluation process.

Rigorous Evaluation Process

If you can’t measure, you can’t improve. This is an always true statement that applies to everything we build. We cannot improve the quality, performance, cost, etc. of any artifact we build if we don’t know how to measure those attributes.

Today, many developers can build impressive demos, but often those demos are not harnessed for a rigorous measurement of quantities (metrics) to help gage the quality and performance of the product before and after deployment.

There are customary primary metrics used in the ML disciplines for decades—such as accuracy, precision, recall, computational cost and latency—but also secondary correlates of customer satisfaction and ROI, like task completion, escalation, human feedback and user abandon rate.

So, it is the responsibility of AI science to create and adopt sounds metrics that correlate to all the above quantities and the mechanisms that allow for the continuous and automated measurement of them.

Some of the metrics may require a human in the loop. For instance, to evaluate the accuracy of an AI-generated summary, we may need to start with trained annotators who can deem whether a summary is accurate or not (keeping in mind that what defines “accurate” may not be straightforward).

However, while using a human annotator guarantees a certain precision of a metric, assuming the human raters are properly trained, that’s hard to scale.

Thus, one of the many goals of AI science is that of automating the extraction of metrics that are generally performed by humans, with a high level of precision.

“It is the responsibility of AI science to create and adopt sounds metrics that correlate to all the quantities and the mechanisms that allow for the continuous and automated measurement of them.” Roberto Pieraccini, Chief Scientist

Getting feedback regularly is the key to maintaining the health of the AI systems deployed to customers and continuing to improve them.

However, there are several issues that make that hard. On the one hand, the diversity of the architectural implementations can make it hard to develop a general feedback strategy. Some deployments are at the customer premises, and some are on a cloud platform.

That is a source of additional complexity. Then there is the problem of data privacy, and finally the calculation of accuracy that may require the precise annotation of the users’ and agents’ utterances.

However, without the possibility of continuous feedback, we are in the presence of open-loop systems that may degrade with time and miss the opportunity to learn from the data.

There are several AI science initiatives that try to overcome some of the above issues. One of them is the possibility of automatic annotation, without necessarily introducing humans in the loop, and thus preserving the privacy and confidentiality of the data.

That is accomplished using more computationally expensive AI systems than the ones used in production that, not being required to produce results in real time, can achieve a higher accuracy and thus allow the creation of annotated data used as a reference for monitoring and tuning.

An example of those methods is based on model’s ensembles, in other words several models running in parallel, each one of them producing a response that is then moderated by a classifier that chooses the one that is presumably the most accurate.

That method can be implemented for creating reference transcription and semantically annotated data.

In conclusion, we think that AI science adds dramatic value to the deployment of AI artifacts, such as ASR and NLP, even in today’s democratized world where everyone is exposed to the power of out-of-the-box generative AI.

To appreciate this, we need to understand the importance of the knowledge of phenomena like speech and natural language, a deep understanding of the workings of machine learning, including large language models, and the use of rigorous scientific evaluation methodologies and proper metrics.

That’s why we employ teams of scientists with deep ML and language expertise to develop the next generation of enterprise AI solutions.

Author: Guest Author

Published On: 14th May 2024 - Last modified: 23rd Oct 2024
Read more about - Guest Blogs, Uniphore