Groq LPU: The Inference Speed Revolution
Why Groq's LPU is 10x faster than NVIDIA GPUs for inference. A look at deterministic computing and the end of memory bottlenecks.
Groq LPU: The Inference Speed Revolution
If you’ve used an LLM recently, you know the feeling: you type a prompt, and then watch the words stream in… slowly. Maybe 20 words per second. It feels like reading a telegraph.
Then Groq (not to be confused with Elon Musk’s Grok) launched their public demo, and the AI world gasped. It was generating 500 words per second. It wasn’t streaming; the text just appeared.
How? They didn’t use GPUs. They used an LPU (Language Processing Unit).
The Problem with GPUs
GPUs were designed to render pixels on a screen. They excel at parallel processing (doing 10,000 things at once). This is great for Training, where you process massive batches of data simultaneously.
But Inference (generating text) is sequential.
- Predict Word 1.
- Use Word 1 to Predict Word 2.
- Use Word 1+2 to Predict Word 3.
You can’t parallelize this. You have to wait for the previous token. Crucially, GPUs rely on HBM (High Bandwidth Memory). Every time the GPU needs data, it has to fetch it from memory, compute, and send it back. This trip takes time. This is the Memory Wall.
The Groq Solution: Determinism
Groq’s LPU architecture is radically different.
1. No External Memory (HBM)
The LPU doesn’t have slow off-chip memory. It has massive amounts of ultra-fast SRAM (Static RAM) directly on the chip (230MB per chip). Because the model weights live inside the processor, there is virtually zero latency in fetching data.
2. Deterministic Execution
On a GPU, hardware schedulers decide dynamically which core does what. This introduces “jitter” and unpredictability. The Groq compiler knows exactly—down to the nanosecond—what every part of the chip will be doing at every moment. There is no scheduler. It’s like a choreographed dance.
This allows them to network hundreds of chips together without the usual communication overhead.
The Trade-Offs
If LPUs are so fast, why does NVIDIA still exist?
1. Capacity
One Groq chip has only ~230MB of memory. A Llama-3-70B model requires roughly 40GB-140GB (depending on quantization). To run one instance of Llama-70B, Groq needs to chain together hundreds of chips (roughly 576 chips for a full rack). An H100 system can run it on just 2-4 cards.
This means Groq is hardware-intensive. You need a warehouse of chips to serve models.
2. Inference Only
LPUs are not designed for training. You still need NVIDIA GPUs to create the model. Groq is purely for running it.
Why Speed Matters
Is 500 tokens/sec just a gimmick? No. Speed changes how we use AI.
- Real-time Voice: With standard latency, talking to AI feels like a walkie-talkie conversation (lag). With Groq, it feels like a phone call. The AI can interrupt you or respond instantly.
- Agentic Workflows: If you have an AI agent that needs to “think” 50 times to solve a coding bug, waiting 10 seconds per step takes 10 minutes. With Groq, 50 steps happen in seconds.
Conclusion
Groq proved that general-purpose hardware (GPUs) isn’t the endgame. As AI models stabilize, we will see more ASICs (Application-Specific Integrated Circuits) like the LPU designed to run specific architectures at blinding speeds.
NVIDIA is the general contractor; Groq is the Formula 1 specialist.