AI Inference Optimization: Making Models Fast and Cheap

Training a model is expensive but happens once. Inference runs every time someone asks a question — at scale, it’s the dominant cost. A 70B parameter model in fp16 needs ~140GB of GPU VRAM just to load. Getting it to respond in under a second, at 100 concurrent requests, is an engineering challenge.

Here’s how the industry does it.

The Inference Bottleneck

LLM inference is memory-bandwidth bound, not compute bound. The GPU spends more time loading weights from HBM (High Bandwidth Memory) than actually multiplying numbers. This is the opposite of training.

Implication: techniques that reduce memory footprint and bandwidth pressure have outsized impact.

Quantization

Quantization reduces the numerical precision of weights (and sometimes activations) from 32-bit or 16-bit floats to lower-bit integers.

FormatBitsMemory (70B model)Quality loss
fp3232~280 GBBaseline
fp16 / bf1616~140 GBMinimal
int88~70 GBSmall
int4 (GPTQ/AWQ)4~35 GBModerate
int2/32-3~17-26 GBSignificant

Post-Training Quantization (PTQ): Quantize a trained model without retraining. Fast, but some quality loss.

Quantization-Aware Training (QAT): Train with simulated quantization. Better quality, more compute.

GPTQ and AWQ are the dominant int4 methods for LLMs. AWQ specifically preserves the “salient” weights (those with large activations) in higher precision, giving better quality than naive int4.

# Using llama.cpp with Q4_K_M quantization
./llama-cli -m Llama-3.1-70B.Q4_K_M.gguf -p "Hello!" -n 100

KV Cache

During autoregressive generation, the model recomputes the same Key and Value vectors for previous tokens at every new token. This is wasteful.

The KV cache stores K and V tensors from previous generation steps, so only the new token’s KV vectors need to be computed. This dramatically reduces compute per token — at the cost of memory.

KV cache memory scales with:

  • Sequence length
  • Number of layers
  • Number of heads
  • Batch size

For long contexts (128K+ tokens) with large batches, KV cache can easily exceed the weight memory. This is why KV cache management is a major focus of inference frameworks.

KV Cache Optimizations

Multi-Query Attention (MQA): Share a single K and V head across all Q heads. Reduces KV cache by n_heads factor.

Grouped-Query Attention (GQA): Share K/V among groups of Q heads. A compromise between full MHA and MQA. Used in Llama 3, Mistral.

PagedAttention (vLLM): Manage KV cache like virtual memory — allocate pages on demand, avoid fragmentation. Enables much higher throughput for variable-length batches.

Speculative Decoding

Standard decoding generates one token per forward pass. For a 70B model, this is slow.

Speculative decoding uses a small “draft” model (e.g., 7B) to generate K candidate tokens cheaply, then verifies all K tokens with the large model in one pass. If the large model agrees, you get K tokens for the price of ~1 large pass.

Speed gains: 2-4x for tasks where the draft model is frequently right (structured output, repetitive text, code).

Implementations: speculative_decoding in Hugging Face Transformers, supported natively in TGI and vLLM.

Continuous Batching

Naive batching waits for a full batch before starting inference, and waits for the slowest sequence to finish. Both waste time.

Continuous batching (also called iteration-level scheduling) inserts new requests mid-generation and removes completed sequences immediately. This maximizes GPU utilization at the cost of implementation complexity.

vLLM, TGI, and TensorRT-LLM all implement continuous batching. It’s now baseline for production inference.

Flash Attention

Standard attention computes the full n×n attention matrix, which:

  • Is O(n²) in memory
  • Requires multiple read/write passes to HBM

Flash Attention fuses the attention computation into a single kernel using tiling, keeping intermediate values in fast SRAM (on-chip cache) instead of HBM. Result:

  • Same mathematical output
  • 2-4x faster in practice
  • O(n) memory (doesn’t materialize the full matrix)

Flash Attention 2 and 3 improved parallelism and GPU utilization further. It’s now the default in every major framework.

Tensor Parallelism

For multi-GPU inference, tensor parallelism splits weight matrices across GPUs. Each GPU computes a partition of the matrix multiply, then communicates partial results via NVLink or InfiniBand.

Common configs:

  • Single A100 80GB: up to ~40B model in fp16
  • 2× H100: up to ~80B comfortably
  • 8× H100: 405B+ possible

Tools: Megatron-LM, vLLM, TGI, TensorRT-LLM all support tensor parallelism.

Inference Frameworks at a Glance

FrameworkBest forKey feature
vLLMOpenAI-compatible servingPagedAttention, continuous batching
TGIHuggingFace ecosystemWide model support
llama.cppCPU/local inferenceGGUF quantization, no GPU required
TensorRT-LLMNVIDIA productionMaximum throughput on H100s
OllamaDeveloper local useSimple CLI/API

Practical Recommendations

  • Start with quantization — int8 is nearly free in quality, halves memory. GPTQ/AWQ int4 is usable for most tasks.
  • Use Flash Attention — it’s already enabled in most frameworks, but verify.
  • Measure, don’t assume — tokens/second per dollar varies wildly by model, hardware, and batch size.
  • KV cache is often the bottleneck at scale — budget memory accordingly.
  • Speculative decoding pays off for greedy/structured generation — test it for your workload.

Inference is where AI meets the real world. The teams that optimize it well can serve 10x more users at the same cost — or build products their competitors simply can’t afford to.