Since ChatGPT first came onto the scene and captivated entrepreneurs (and the world), we’ve witnessed a massive explosion in products and services using LLMs to address the “lowest hanging fruit” use cases for generative AI: text-based tasks — spanning everything from creating legal contracts and job descriptions to drafting emails and website copy.
Demand for text-based AI solutions remains high. AI can take over time-consuming tasks like creating first drafts, thereby refocusing employee efforts to more complex functions. But much more of our day-to-day work requires data types and capabilities other than just text, such as speaking to customers and reasoning over complex images and graphical data. Today, use cases like these are no longer off the table.
The emergence of multimodal models has created opportunities for vertical AI to impact a much larger share of the economy than previously imagined by expanding beyond text-based tasks and workflows. In Part II, we report on new models that support a variety of data types across audio, video, voice, and vision, promising early applications of new and improved voice and vision capabilities, and the potential of AI agents to change how businesses operate.
Exciting developments in multimodal architecture
In the past 12 months, new models have emerged that demonstrate significant advancements in terms of their ability to understand context and reduce hallucinations, as well as their overall reasoning capabilities. The performance we’re seeing across speech recognition, image processing, and voice generation in certain models is approaching (or, in some cases, surpassing) human capabilities, unlocking many new use cases for AI.
Voice capabilities
We’ve seen rapid progress made on two core components of the conversational voice stack: speech-to-text models (automatic speech recognition) and text-to-speech models (generative voice). Dozens of vendors are now providing models with these capabilities, which has led to a flurry of new AI applications, particularly in the case of conversational voice.
Most of these applications rely on what’s called a “cascading architecture,” where voice is first transcribed to text, then that text is fed into an LLM to generate a response, and finally the text output is fed back into the generative voice model to produce an audio response. Up until very recently, this has been the best way to build conversational voice applications. However, the approach has a few drawbacks — primarily that it introduces additional latency and some of the non-textual context (i.e., the end user’s emotion and sentiment) gets lost in the transcription process.
As of the time of this writing, a new generation of speech-native models are being released including OpenAI’s Realtime API, which supports speech-to-speech interactions via GPT-4o, as well as several open-source projects such as Kyutai’s Moshi. Developing models capable of processing and reasoning on raw audio has been an active area of research for many years and it’s been widely acknowledged that speech-native models would eventually replace cascading architecture.
Speech-native models have substantially lower latency (< 500 milliseconds) than previous models. They can also capture much more context from users (i.e., their tone, sentiment, emotion, etc.) and generate responses reflecting that context, making exchanges feel more natural and increasing the likelihood that they address the user’s needs. Over the next few years, we anticipate a step function change in the speed and quality of conversational voice applications, as more of them are built on these new and improved models.