How It Works March 21, 2026 ⏱ 6 min read

Multimodal Models Explained: When AI Sees, Hears, and Reads

How modern AI models process images, audio, and text together — the architecture behind GPT-4o, Gemini, and the multimodal revolution.

multimodalvisionaudioarchitecturegpt-4ogemini

Multimodal Models Explained: When AI Sees, Hears, and Reads

For most of AI’s history, models were specialists. Language models read text. Image classifiers looked at pictures. Speech recognizers transcribed audio. Each modality had its own architecture, its own training data, its own deployment.

That era is ending. Multimodal models process multiple types of input — text, images, audio, video — within a single unified system. GPT-4o can see a photo and describe it. Gemini can watch a video and answer questions about it. The boundaries between modalities are dissolving.

Why Multimodality Matters

The world isn’t text-only. Real-world tasks constantly combine modalities:

“What’s wrong with this error message in my screenshot?”
“Describe what’s happening in this chart”
“Transcribe this audio and summarize it”
“Does this code match the architecture diagram?”

Multimodal models handle these naturally. Their single-modality predecessors required awkward pipelines: OCR the image, pipe to a language model, hope nothing got lost in translation.

The Architecture Challenge

The core problem: text, images, and audio are fundamentally different data formats.

Text: Discrete tokens from a vocabulary
Images: Continuous pixel grids (typically 224×224 to 1024×1024)
Audio: Continuous waveforms sampled at 16-44kHz

To feed all of these into a transformer, you need to convert them into the same representational space — a sequence of vectors the transformer can process uniformly.

Vision: From Pixels to Tokens

ViT (Vision Transformer) Approach

The most common method splits an image into fixed-size patches (e.g., 16×16 pixels), linearly projects each patch into an embedding vector, and feeds the sequence to a transformer.

Image (224×224) → 196 patches (16×16) → 196 embeddings → Transformer

Each patch becomes analogous to a text token. The transformer attends across all patches plus any text tokens.

CLIP-Style Encoders

Many multimodal LLMs use a pre-trained vision encoder (like CLIP or SigLIP) to convert images into embeddings, then project those into the language model’s embedding space.

Image → CLIP Vision Encoder → Visual Embeddings
                                      ↓
                              Linear Projection
                                      ↓
Text tokens ────────────→ Combined Sequence → LLM

The projection layer is the bridge — it learns to map visual representations into a space the language model understands.

High-Resolution Tricks

A 224×224 ViT loses detail for small text or fine-grained objects. Modern approaches slice images into multiple crops at different resolutions:

One low-res overview of the whole image
Several high-res crops of regions
All crops processed and concatenated

LLaVA 1.6, InternVL, and similar models use this approach to handle document images and screenshots effectively.

Audio: Spectrograms and Speech Tokens

Audio adds another dimension. Two dominant approaches:

Spectrogram-Based (Whisper-style)

Convert raw audio to a mel spectrogram (a 2D time-frequency representation), then treat it like an image with a CNN or ViT encoder.

OpenAI’s Whisper works this way — it’s essentially a vision transformer trained on spectrograms. GPT-4o’s audio understanding builds on this foundation.

Discrete Audio Tokens

More recent work converts audio to discrete tokens using vector quantization (similar to how images can be tokenized with VQ-VAE). This lets audio tokens sit directly in the same sequence as text tokens.

Audio waveform → EnCodec/SoundStream → Discrete tokens → LLM

This enables truly native audio generation — the model outputs audio tokens directly, which are then decoded back to waveforms.

The “Natively Multimodal” Question

There’s an important distinction between:

1. Bolted-on multimodality: A language model with adapters to process other modalities. Image is encoded separately, projected in, LLM processes the combined sequence. Clean and practical.

2. Natively multimodal: The model is trained from scratch on all modalities simultaneously. The same weights process text, images, and audio with no separation.

GPT-4o claims to be natively multimodal — trained end-to-end on all modalities. This theoretically enables better cross-modal reasoning (the model doesn’t need to “translate” between modalities internally) and enables audio output directly from the model rather than a bolted-on TTS system.

Gemini 1.5 and 2.0 take a similar approach, pre-training on interleaved text-image-audio-video data from the start.

Video Understanding

Video is images over time — but you can’t just feed every frame (a 1-minute video at 30fps = 1800 frames).

Key strategies:

Frame sampling: Select representative frames (uniform, or keyframe detection). Process each as an image.

Temporal encoding: Add positional encodings that capture time, so the model understands frame ordering.

Video-specific models: Architectures like Video-LLaMA or InternVideo2 include temporal attention layers that let the model reason about motion and sequences.

The challenge: video is extremely token-hungry. Even at 1 frame/second with compressed visual tokens, a 10-minute video might require 10,000+ tokens — pushing context limits.

Once all modalities are in the same embedding space, the standard transformer attention mechanism handles the rest. Text tokens can attend to image patches, audio frames can attend to text context — it all flows through the same attention matrix.

This is why the transformer was such a powerful unifying architecture. Given embeddings, it doesn’t care where they came from.

Real-World Multimodal Models (2024-2025)

Model	Modalities	Architecture
GPT-4o	Text, Image, Audio	Native multimodal
Gemini 2.0 Flash	Text, Image, Audio, Video	Native multimodal
Claude 3.5 Sonnet	Text, Image	Vision encoder + LLM
LLaVA 1.6	Text, Image	CLIP + Vicuna
Qwen-VL	Text, Image	ViT + Qwen LLM
Phi-3.5 Vision	Text, Image	CLIP + Phi

What Multimodal Models Still Struggle With

Spatial reasoning: Counting objects, understanding left/right relationships, precise localization.

Fine-grained detail: Small text, low-resolution portions of images, subtle visual differences.

Long video: Maintaining coherent understanding across many minutes of footage.

Audio-visual sync: Understanding that the person speaking in the video is the same as the face on screen.

Hallucination: Multimodal models can hallucinate image content just as readily as text content — sometimes more so, because evaluation is harder.

The Direction of Travel

The trajectory is clear: modalities will continue to collapse into unified models. The question isn’t whether future AI will be multimodal — it’s how seamlessly the boundaries will dissolve.

Native multimodal training, more efficient visual tokenization, and better temporal understanding for video are the active research frontiers. By the time you read this, the state of the art has probably moved again.

That’s multimodal AI: a field that’s perpetually mid-sentence.

Multimodal Models Explained: When AI Sees, Hears, and Reads

Why Multimodality Matters

The Architecture Challenge

Vision: From Pixels to Tokens

ViT (Vision Transformer) Approach

CLIP-Style Encoders

High-Resolution Tricks

Audio: Spectrograms and Speech Tokens

Spectrogram-Based (Whisper-style)

Discrete Audio Tokens

The “Natively Multimodal” Question

Video Understanding

Cross-Modal Attention

Real-World Multimodal Models (2024-2025)

What Multimodal Models Still Struggle With

The Direction of Travel