Multimodal Models Explained: When AI Sees, Hears, and Reads

For most of AI’s history, models were specialists. Language models read text. Image classifiers looked at pictures. Speech recognizers transcribed audio. Each modality had its own architecture, its own training data, its own deployment.

That era is ending. Multimodal models process multiple types of input — text, images, audio, video — within a single unified system. GPT-4o can see a photo and describe it. Gemini can watch a video and answer questions about it. The boundaries between modalities are dissolving.


Why Multimodality Matters

The world isn’t text-only. Real-world tasks constantly combine modalities:

  • “What’s wrong with this error message in my screenshot?”
  • “Describe what’s happening in this chart”
  • “Transcribe this audio and summarize it”
  • “Does this code match the architecture diagram?”

Multimodal models handle these naturally. Their single-modality predecessors required awkward pipelines: OCR the image, pipe to a language model, hope nothing got lost in translation.


The Architecture Challenge

The core problem: text, images, and audio are fundamentally different data formats.

  • Text: Discrete tokens from a vocabulary
  • Images: Continuous pixel grids (typically 224×224 to 1024×1024)
  • Audio: Continuous waveforms sampled at 16-44kHz

To feed all of these into a transformer, you need to convert them into the same representational space — a sequence of vectors the transformer can process uniformly.


Vision: From Pixels to Tokens

ViT (Vision Transformer) Approach

The most common method splits an image into fixed-size patches (e.g., 16×16 pixels), linearly projects each patch into an embedding vector, and feeds the sequence to a transformer.

Image (224×224) → 196 patches (16×16) → 196 embeddings → Transformer

Each patch becomes analogous to a text token. The transformer attends across all patches plus any text tokens.

CLIP-Style Encoders

Many multimodal LLMs use a pre-trained vision encoder (like CLIP or SigLIP) to convert images into embeddings, then project those into the language model’s embedding space.

Image → CLIP Vision Encoder → Visual Embeddings

                              Linear Projection

Text tokens ────────────→ Combined Sequence → LLM

The projection layer is the bridge — it learns to map visual representations into a space the language model understands.

High-Resolution Tricks

A 224×224 ViT loses detail for small text or fine-grained objects. Modern approaches slice images into multiple crops at different resolutions:

  • One low-res overview of the whole image
  • Several high-res crops of regions
  • All crops processed and concatenated

LLaVA 1.6, InternVL, and similar models use this approach to handle document images and screenshots effectively.


Audio: Spectrograms and Speech Tokens

Audio adds another dimension. Two dominant approaches:

Spectrogram-Based (Whisper-style)

Convert raw audio to a mel spectrogram (a 2D time-frequency representation), then treat it like an image with a CNN or ViT encoder.

OpenAI’s Whisper works this way — it’s essentially a vision transformer trained on spectrograms. GPT-4o’s audio understanding builds on this foundation.

Discrete Audio Tokens

More recent work converts audio to discrete tokens using vector quantization (similar to how images can be tokenized with VQ-VAE). This lets audio tokens sit directly in the same sequence as text tokens.

Audio waveform → EnCodec/SoundStream → Discrete tokens → LLM

This enables truly native audio generation — the model outputs audio tokens directly, which are then decoded back to waveforms.


The “Natively Multimodal” Question

There’s an important distinction between:

1. Bolted-on multimodality: A language model with adapters to process other modalities. Image is encoded separately, projected in, LLM processes the combined sequence. Clean and practical.

2. Natively multimodal: The model is trained from scratch on all modalities simultaneously. The same weights process text, images, and audio with no separation.

GPT-4o claims to be natively multimodal — trained end-to-end on all modalities. This theoretically enables better cross-modal reasoning (the model doesn’t need to “translate” between modalities internally) and enables audio output directly from the model rather than a bolted-on TTS system.

Gemini 1.5 and 2.0 take a similar approach, pre-training on interleaved text-image-audio-video data from the start.


Video Understanding

Video is images over time — but you can’t just feed every frame (a 1-minute video at 30fps = 1800 frames).

Key strategies:

Frame sampling: Select representative frames (uniform, or keyframe detection). Process each as an image.

Temporal encoding: Add positional encodings that capture time, so the model understands frame ordering.

Video-specific models: Architectures like Video-LLaMA or InternVideo2 include temporal attention layers that let the model reason about motion and sequences.

The challenge: video is extremely token-hungry. Even at 1 frame/second with compressed visual tokens, a 10-minute video might require 10,000+ tokens — pushing context limits.


Cross-Modal Attention

Once all modalities are in the same embedding space, the standard transformer attention mechanism handles the rest. Text tokens can attend to image patches, audio frames can attend to text context — it all flows through the same attention matrix.

This is why the transformer was such a powerful unifying architecture. Given embeddings, it doesn’t care where they came from.


Real-World Multimodal Models (2024-2025)

ModelModalitiesArchitecture
GPT-4oText, Image, AudioNative multimodal
Gemini 2.0 FlashText, Image, Audio, VideoNative multimodal
Claude 3.5 SonnetText, ImageVision encoder + LLM
LLaVA 1.6Text, ImageCLIP + Vicuna
Qwen-VLText, ImageViT + Qwen LLM
Phi-3.5 VisionText, ImageCLIP + Phi

What Multimodal Models Still Struggle With

Spatial reasoning: Counting objects, understanding left/right relationships, precise localization.

Fine-grained detail: Small text, low-resolution portions of images, subtle visual differences.

Long video: Maintaining coherent understanding across many minutes of footage.

Audio-visual sync: Understanding that the person speaking in the video is the same as the face on screen.

Hallucination: Multimodal models can hallucinate image content just as readily as text content — sometimes more so, because evaluation is harder.


The Direction of Travel

The trajectory is clear: modalities will continue to collapse into unified models. The question isn’t whether future AI will be multimodal — it’s how seamlessly the boundaries will dissolve.

Native multimodal training, more efficient visual tokenization, and better temporal understanding for video are the active research frontiers. By the time you read this, the state of the art has probably moved again.

That’s multimodal AI: a field that’s perpetually mid-sentence.