OMG, Machines Can See, Hear, and Talk Like Humans Thanks To Multimodal AI

8 min readMay 29, 2024

Artificial intelligence (AI) machines can now respond to any input in real-time, without any hesitancy (latency) and….get this…with any kind of emotion you want them to.

In November 2023, I wrote:

The next BIG thing coming to AI is multimodal.As multimodal gen-AI continues to develop, it has the potential to dramatically change the way we interact with computers, information and each other.

A week or so ago, both, OpenAI and Google released their multimodal-based user-interface (UI)/AI agent called “ChatGPT-4o” and “Project Astra’”, respectively.

Thus, the second step in man-machine interaction was taken. And believe me when I say this — this is BIG. It will fundamentally change the way we humans interact with machines; maybe, within the next 6–9 months.

Talking with a multimodal AI agent is like conversing with a friend who understands your words, and can also pick up on your facial expressions, tone of voice, and even the gestures you use.

Multimodal AI — The Interface

Most of you know how computer interfaces (how man interacts with a machine) have evolved over the last 50 years. We’ve progressed from command-line interfaces (yeah, remember ‘em?) to graphical user interfaces (GUIs) we see in our desktops, to advanced technologies like touch screens, voice assistants, and augmented reality.

Your World Wide Web browser is an interface for getting content, your touch screen on smart computing devices like your phones with gestures like pinch and swipe is an interface. Voice assistants like Siri are also interfaces.

Now think of all of the above + AI = Multimodal AI.

Multimodal gen-AI is a type of AI that can create content in many “modalities” or forms like text, images, and audio. And the user can use any of these forms — text, image or audio — to communicate (give instructions) with the machine. This contrasts with traditional gen-AI, which was typically limited to input/output in a single modality, i.e. text.

So think of a Siri or an Alexa that’s 1000+ times more powerful, better, and responsive. Alexa, Siri, etc were “unimodal” (voice activated) with a human voice responding to verbal questions/inputs. Honestly, they sounded like someone emotionlessly reading from an online search results page. Also, Siri and Alexa can sometimes struggle with accents, slang, or background noise but not multimodal AI.

Multimodal AI agents are far superior to relatively simple machine learning-led voice assistants. These agents can understand text, voice, image inputs/commands, and even a user’s facial expressions, and can respond to all of those with emotions.

They come with computer memory and have no latency/lag while replying, and can do so in absolute real-time. You can even interrupt a multimodal AI agent mid-conversation, and ask for additional information on your original input, something not possible before.

Conversation between Man and Machine now feels more natural, almost as if a human friend and you are chatting with each other. And this will only keep getting better over time.

What’s more, multimodal AI agents can factor in non-verbal cues like tone of voice and facial expressions. This allows for a more natural back-and-forth conversation, similar to how humans communicate.

Recent Advancements: ChatGPT-4o and Project Astra

Two recent advancements have started humans on the path of multimodal AI interaction.

Let’s look at OpenAI’s ChatGPT-4o first.

This successor to the popular ChatGPT-4 now has a voice (remember the Scarlett Johansson controversy?). It builds upon its text-based prowess by incorporating image recognition. You can converse with it. You can, for example, describe an object in a photo, and the chatbot will answer questions about it! While all previous editions of ChatGPT primarily focused on TEXT input and output, ChatGPT-4o incorporates additional modalities to create a more comprehensive understanding and generation of language.

Here’s a deeper dive into what makes ChatGPT-4o unique:

The way it works is: you download the ChatGPT app on your PC or your smartphone. You then point the camera to any object or destination or anything, for that matter, and ask innumerable questions (voice) to ChatGPT. A human voice with all the emotions and pauses, and ahs and ums will then answer all your queries, maybe even mildly flirt with you if you want, to make you blush!

(Tip: Please DO watch the two videos on ChatGPT-4o and Project Astra down this article. You will be blown away. )

You can use it to solve incomplete math formulas, spot errors in your recipe, sing you a lullaby, whatever…..

ChatGPT-4o can process information beyond just text. This could include:

Speech: Recognizing and understanding spoken language, including tone, intonation, and sentiment.

Images: Analyzing images and using them to inform the context of a conversation or guide text generation.

Emojis: Interpreting the sentiment and meaning conveyed through emojis in text-based communication.

With a richer understanding of context, ChatGPT-4o can:

Generate more nuanced and human-like text: It can tailor its responses based on the perceived emotions and intent behind the user’s input.

Adapt to different writing styles: It can adjust its language depending on the situation, whether it’s a casual chat, a formal email, or a creative writing prompt.

Respond with different creative text formats: Beyond standard text, it might be able to generate poems, scripts, musical pieces, or even computer code based on the user’s request and the additional context provided.

Voice capabilities: A significant addition is the integration of voice functionality. Users can directly have spoken conversations with ChatGPT-4o, with the model understanding and responding in real-time using synthesized speech.

Now onto Google’s Project Astra:

Project Astra is an AI agent developed by Google that understands and responds to information from various sources, including text, speech, images, and even the real world through a smartphone camera.

This focus on multimodality allows Project Astra to be a more versatile and powerful AI assistant. For instance, you can show it a picture of a malfunctioning clothes iron through your phone. It could then analyze the image, understand the context from your voice explanation, and offer solutions or connect you with a repair service.

By combining its multimodal capabilities with the power of existing AI models like Gemini, Project Astra sets the stage for a future where AI agents can seamlessly integrate into our lives, understanding not just our words but also the world around us.

CHECK OUT THESE TWO VIDEOS

https://youtu.be/nXVvvRhiGjI

https://youtu.be/DQacCB9tDaw

How will Multimodal AI Change Our Interactions With Machines?

AI Assistants or AI agents are already on the way (Microsoft Copilot). All of them will have multimodal AI UI, a crucial component for interaction between Man and AI Machine.

Think of these agents as on-demand experts in numerous fields — law, medicine, arts, politics, and more — ready to provide consultation whenever needed. They will feature distinct digital human-like personas, and users will interact with them through multimedia interfaces that blend voice, visuals, and text, offering a conversational experience similar to communicating with another person.

Going Beyond the Gift of the Gab: How Multimodal AI Makes AI Agents Eloquent…and Expressive

One of the most exciting applications of multimodal AI lies in its ability to revolutionize how AI agents communicate with humans. Here’s how:

1. Reading Between the Lines: Imagine a conversation where you say “Sure” with a sigh, implying reluctance. A multimodal AI agent, by processing your tone of voice, might respond with, “Is everything alright? Sure doesn’t sound like it.” This nuanced understanding allows for natural back-and-forth conversations that go beyond just words.

2. Cracking the Non-Verbal Code: Multimodal AI can analyze facial expressions and gestures alongside speech. A smile during a complaint can indicate sarcasm, while furrowed brows with a question might suggest confusion. By factoring in these visual cues, AI agents can adjust their responses to better reflect the user’s intent.

3. The Power of Context: Multimodal AI can analyze past interactions and the current situation to understand the context of a conversation. For instance, if you’re talking about a movie you just saw with friends, the AI agent can reference the movie title you mentioned earlier, rather than needing you to repeat it. This creates a smoother and more natural flow of conversation.

4. Overcoming Latency: One challenge of real-time conversations is latency, the time it takes for a system to process information and respond. Multimodal AI models are optimized for speed, allowing for near-instantaneous responses that feel fluid and natural.

Cut to the Chase: How Will AI Agents Really Help Us?

Multimodal capabilities empowered AI agents will help humans in several ways. These machines will be to anticipate our needs. So imagine a car that adjusts the temperature based on the driver’s facial expressions or a home thermostat that learns your preferences.

Multimodal AI could personalize games or movies based on your emotional responses. If it detects a bored look on your face, it may switch to a new movie without you perhaps even asking it to do so.

Or think of a virtual assistant who understands your frustration from your voice and suggests calming techniques.

Or a customer service agent who can decipher your emotions from your facial expressions and tailor their response, accordingly. A customer service AI, for example, might analyze your voice tone alongside your message to better understand your sentiment.

Or imagine video conferencing software that translates sign language or captions emotions for a more inclusive experience.

Or a non-Italian, English speaking tourist in Rome asking directions in English to a native and getting responses translated into English in real-time.

With a richer data pool, AI agents can make more informed decisions. A self-driving car, for example, could use visual data alongside LiDAR (light detection and ranging) to navigate complex road situations.

………

It’s clear that the way humans communicate with machines is set to change drastically. Driving AI agents, multimodal AI will significantly boost our efficiency and perhaps, our creativity. Our relationships and interactions with technology will be fundamentally transformed, becoming more personal, intelligent, and collaborative than ever before.