Text, Images, Audio, Video - Oh My!

In partnership with

The Lightwave

Practical Insights for Skeptics & Users Alike…in (Roughly) Two Minutes or Less

Learn AI in 5 Minutes a Day

AI Tool Report is one of the fastest-growing and most respected newsletters in the world, with over 550,000 readers from companies like OpenAI, Nvidia, Meta, Microsoft, and more.

Our research team spends hundreds of hours a week summarizing the latest news, and finding you the best opportunities to save time and earn more using AI.

"There is no doubt that any chief digital transformation officers or chief AI officers worth their salt will be aware of multimodal AI and are going to be thinking very carefully about what it can do for them."

-Henry Ajder, founder of AI consultancy Latent Space,

Think Like Me

Yesterday, we shared a bit about OpenAI’s recent announcement of GPT-4o mini, with a particular emphasis on its multimodal capabilities—a way to bring us closer to mirroring the multisensory way humans process information from various data streams (audio, text, visual, etc.)

Today we’ll take a deeper look into how and why multimodal is so transformational.

Recap of Definition

Multimodal AI refers to artificial intelligence systems capable of processing and integrating multiple types of input data, such as text, images, audio, and video.

Unlike traditional AI models that specialize in a single data type (unimodal), multimodal AI aims to create a more comprehensive and nuanced understanding of information, much like the human brain does when interpreting the world around us.

How We Got Here

The evolution of multimodal AI can be traced through several key developments and current applications:

Early foundations: The roots of multimodal AI can be found in separate advancements in natural language processing, computer vision, and speech recognition. These individual fields laid the groundwork for future integration.

Fusion techniques: Researchers developed methods to combine different data modalities, such as early image captioning systems that merged visual and textual information.

Transformer architecture: The introduction of transformer models, originally designed for language tasks, proved adaptable to other modalities, accelerating multimodal integration.

(In Jar-b-Gone: It’s like having an assistant who was originally great at writing emails, but then you realized you could teach them to caption photos, transcribe voice messages, and even describe video clips using the same core skills. This versatility of the Transformer architecture has been a key factor in speeding up the development of multimodal AI systems that can understand and work with many different types of information simultaneously.)

GPT-3 and DALL-E: While not fully multimodal, these models demonstrated the potential for AI to generate text based on images and vice versa, pushing the boundaries of cross-modal understanding.

Analyzing Toy Story GIF by Giflytics

Gif by Giflytics on Giphy

The Challenge of These Systems

There are many reasons why training multimodal tools is more difficult than unimodal systems.

Complexity of data integration:
Multimodal systems need to process and integrate data from different modalities (e.g. text, images, audio), each with its own unique characteristics and representations. This requires developing complex architectures that can effectively combine and align information across modalities.

Increased computational demands:
Processing multiple data types simultaneously requires more computational resources and often longer training times compared to unimodal models.

Alignment challenges:
Ensuring proper temporal or spatial alignment between different modalities can be difficult, especially with real-world data that may have inconsistencies or missing information.

Variations in data quality:
Different modalities may have varying levels of noise, reliability, or completeness, making it challenging to balance their contributions.

Fusion complexity:
Determining the best way to fuse information from multiple modalities is hard to quantify, and any variations can significantly impact model performance.

Handling missing modalities:
Multimodal systems need to be robust to scenarios where one or more modalities may be missing during inference.

Example:

  • Complete data:

    • Text: Patient reports shortness of breath and coughing

    • Image: Chest X-ray showing cloudy areas in the lungs

    • Audio: Wheezing sounds in lung audio recording

    • Diagnosis: The system confidently diagnoses asthma

  • Missing modality scenario:

    • Text: Patient reports shortness of breath and coughing

    • Image: Chest X-ray showing cloudy areas in the lungs

    • Audio: Not available (perhaps the audio recording device was malfunctioning)

    • Challenge: The system must still provide a reliable diagnosis without the audio input

Larger datasets required:
To effectively learn cross-modal relationships, multimodal systems often require larger and more diverse datasets than unimodal approaches.

Thanks for reading. See you tomorrow.