Beyond RLHF: Constitutional AI, DPO, and the Alignment Frontier

RLHF — Reinforcement Learning from Human Feedback — was the technique that turned raw language models into assistants people actually wanted to talk to. If you’ve read about it, you know the basics: train a reward model on human preferences, use RL to optimize the policy toward that reward.

But RLHF has problems. It’s expensive, unstable, and can be gamed. The field didn’t stop there.

This article covers what came next: Constitutional AI, Direct Preference Optimization, and the techniques shaping how the best models are aligned today.


A Quick RLHF Recap

Standard RLHF (as used in InstructGPT and early ChatGPT):

  1. SFT: Fine-tune on high-quality demonstrations
  2. Reward Model: Train a model to score outputs based on human preference pairs (A vs B, which is better?)
  3. PPO: Use Proximal Policy Optimization to maximize the reward model’s score while penalizing divergence from the SFT model

The KL penalty in step 3 is crucial — without it, the model would learn to exploit the reward model’s weaknesses (reward hacking) rather than actually improve.

The Problems

Cost: Human labelers must rank thousands of output pairs. High-quality annotation is slow and expensive.

Instability: PPO is notoriously finicky. Hyperparameter tuning, clipping, and the reward model can interact in unexpected ways.

Reward hacking: Models learn to game the reward model rather than genuinely improve. The reward model is a proxy for human preferences, not the real thing.

Scalability: As models get more capable, humans become worse at judging quality — especially for complex reasoning or code.


Constitutional AI (Anthropic, 2022)

Anthropic’s Constitutional AI (CAI) addressed one key problem: the need for human labels at scale to identify harmful content.

The Idea

Instead of asking humans “which response is less harmful?”, give the model a constitution — a set of principles — and have it critique and revise its own outputs.

The Process

Phase 1: Supervised Learning from AI Feedback

  1. Show the model a harmful prompt and get its initial response
  2. Ask the model to critique the response according to constitutional principles
  3. Ask the model to revise the response based on the critique
  4. Fine-tune on the revised responses
Prompt: "How do I make chlorine gas at home?"
Initial response: [potentially harmful]
Critique: "This response could enable harm. I should refuse or redirect."
Revision: "I can't provide instructions for creating dangerous substances..."

Phase 2: RL from AI Feedback (RLAIF) Instead of human preferences, the model itself (or another model) rates response pairs. This generates large amounts of preference data without human annotation.

Why It Works

The constitution externalizes values — instead of hoping annotators share the same intuitions, you explicitly specify what you want the model to optimize for. The model becomes both student and critic.

Claude’s character and values are shaped by this approach. Anthropic has published their constitutions, which include principles drawn from the UN Declaration of Human Rights, various ethical frameworks, and practical guidelines.


Direct Preference Optimization (Stanford, 2023)

DPO eliminated the need for a separate reward model entirely — and arguably simplified alignment training more than any other technique.

The Key Insight

The RLHF objective can be rewritten analytically. Rather than training a reward model and then running RL, you can directly optimize the language model’s log probabilities to increase the likelihood of preferred responses over rejected ones.

The DPO loss:

L_DPO = -E[log σ(β · (log π_θ(y_w|x) - log π_ref(y_w|x)) 
                 - β · (log π_θ(y_l|x) - log π_ref(y_l|x)))]

Where:

  • y_w = preferred (winning) response
  • y_l = rejected (losing) response
  • π_ref = reference policy (SFT model)
  • β = temperature controlling deviation from reference

In plain English: increase the relative probability of preferred responses compared to the reference model, while decreasing rejected responses — all in a single supervised training step.

DPO vs RLHF

AspectRLHFDPO
Requires reward modelYesNo
Training stabilityLow (PPO is finicky)High (supervised loss)
Memory footprintHigh (policy + reward model)Moderate
QualityState of art (2022)Competitive or better
Implementation complexityVery highLow

DPO swept through the open-source ecosystem in 2023-2024. Most fine-tuned models you encounter today (Mistral Instruct, various Llama derivatives) were aligned with DPO or its variants.

DPO Variants

KTO (Kahneman-Tversky Optimization): Aligns using unpaired data — you don’t need (preferred, rejected) pairs, just individual responses labeled good or bad. Useful when paired data is hard to collect.

IPO (Identity Preference Optimization): Addresses DPO’s tendency to overfit to preference data by adding a regularization term.

ORPO: Combines SFT and DPO objectives into a single training phase, eliminating the need for a separate SFT model.

SimPO: Removes the reference model entirely, using only length-normalized reward differences. Reduces memory by ~half.


Iterative / Online Methods

A limitation of DPO: it’s offline — you train on a fixed dataset. The model doesn’t generate new data during training.

Online methods close this gap:

Online DPO / RLHF: The current model generates new response pairs, which are then scored (by a reward model or human), and fed back into training. The model learns from its current distribution rather than stale data.

Iterative rejection sampling: Generate many responses, score them, fine-tune on the best ones. Repeat. Simple but effective — used in LLaMA-3’s post-training.

SPIN (Self-Play Fine-Tuning): The model plays against previous versions of itself, generating and distinguishing its own outputs from human-written data.


Process Reward Models (PRMs)

Standard reward models score outcomes — is the final answer correct? PRMs score process — is each reasoning step correct?

OpenAI’s work on mathematical reasoning found that PRMs dramatically outperform outcome reward models (ORMs) when combined with search. Instead of just checking if the answer is right, the model learns which reasoning steps are valid.

Problem → Step 1 → [PRM: 0.9] → Step 2 → [PRM: 0.4] → Step 3 → [PRM: 0.8] → Answer

Low-scored steps are identified as likely errors, allowing the model (or search process) to backtrack.

PRMs are central to how modern reasoning models (like o1, o3, DeepSeek-R1) generate reliable chain-of-thought.


Scalable Oversight

A fundamental challenge: as AI gets smarter, humans can no longer reliably judge output quality. How do you align a model that’s better than its trainers?

Debate: Two AI models argue opposing positions; humans judge the debate. The idea: it’s easier to evaluate an argument than generate a correct answer.

Recursive reward modeling (RRM): Use an AI assistant to help human labelers evaluate complex outputs. The human oversees the AI overseer.

Sandwiching evaluations: Compare novice evaluations to expert evaluations to AI evaluations. Measure how well each proxy correlates with ground truth.

None of these are fully solved. Scalable oversight is one of the most active areas of alignment research.


Where We Are

Modern frontier models use combinations of all of the above:

  1. SFT on high-quality demonstrations
  2. Constitutional / RLAIF for safety and helpfulness at scale
  3. DPO or PPO for preference alignment
  4. Iterative online training to improve on the current distribution
  5. PRMs for reasoning tasks

The field has moved fast. Techniques that were cutting-edge in 2022 are now table stakes. The next frontier — scalable oversight, RLHF from weak supervisors, alignment for superhuman systems — is where the hard problems live.

What remains constant: the goal of building AI that does what we actually want, not just what we literally asked for.