Temperature, Top-K, Top-P: Controlling AI Creativity

When an LLM generates text, it doesn’t just pick “the best” word every time. If it did, it would be boring and repetitive. Instead, it predicts a probability distribution for the next token and then samples from it.

You can control this sampling process with three main dials: Temperature, Top-K, and Top-P.

How the Model Predicts

Imagine the model has finished the sentence: “The cat sat on the…”

It assigns probabilities to the next possible token:

  1. mat (50%)
  2. floor (20%)
  3. couch (10%)
  4. universe (0.001%)

1. Temperature (The Chaos Dial)

Temperature controls the randomness of the selection.

  • Low Temperature (0.0 - 0.3): The model becomes conservative. It almost always picks the most likely token (mat).
    • Use case: Coding, math, factual retrieval where accuracy matters.
  • High Temperature (0.7 - 1.5): The model flattens the probability curve. floor and couch get a higher chance of being picked. universe becomes possible.
    • Use case: Creative writing, brainstorming, poetry.

Math Note: Temperature divides the “logits” before the softmax function. $ \text{softmax}(x_i / T) $ As $T \to 0$, the highest value dominates (becomes 100%). As $T \to \infty$, all values become equal.

2. Top-K Sampling (The VIP List)

Top-K truncates the list of possibilities. It says: “Only consider the top K mostly likely words. Zero out the rest.”

  • If Top-K = 3: The model considers only [mat, floor, couch].
  • It recalculates their probabilities to sum to 100% and picks one.
  • It completely ignores universe (the weird long-tail words).

This prevents the model from going completely off the rails by picking a nonsense word that had a 0.0001% chance.

3. Top-P Sampling (Nucleus Sampling)

Top-P is smarter than Top-K. Instead of a fixed number of words (K), it picks a dynamic set of words whose probabilities add up to P.

  • If Top-P = 0.9: The model goes down the list:
    1. mat (0.5) - Sum is 0.5. Keep going.
    2. floor (0.2) - Sum is 0.7. Keep going.
    3. couch (0.1) - Sum is 0.8. Keep going.
    4. bed (0.1) - Sum is 0.9. Stop.

The pool is now [mat, floor, couch, bed].

Why is this better?

  • In a clear sentence (“The capital of France is…”), the top word Paris might have 99% probability. Top-P will only pick Paris.
  • In a vague sentence (“The meaning of life is…”), the probability is spread out flat. Top-P will include dozens of words in the pool, allowing for variety.

Summary: How to Tune

  • Precision Task (Code/Facts): Temp = 0.0, Top-P = 1.0.
  • Creative Task (Story): Temp = 0.7-0.9, Top-P = 0.9.
  • Wild Brainstorming: Temp = 1.2+.

Most users only need to tweak Temperature. Leave Top-P alone unless you know what you’re doing.