advanced

RLHF — Reinforcement Learning from Human Feedback

Training a New Employee with Daily Reviews

8 min read

The Analogy

Training a New Employee with Daily Reviews

Imagine a new hire whose manager reviews every email they draft and rates it — better and better drafts emerge over weeks.

The manager doesn't rewrite the emails. They just say 'this one is great, this one is tone-deaf, this one is too long.' The employee adjusts based on feedback. Over time, they write exactly the kind of emails the manager loves. RLHF is this process at scale — humans rate AI responses, and the AI learns to produce the kinds of responses humans prefer.

In Plain English

RLHF is how AI models like ChatGPT and Claude are trained to be helpful, harmless, and honest. Human raters compare AI responses and pick the better one, and the model is updated to produce more of what humans prefer.

The Technical Picture

RLHF involves three stages: supervised fine-tuning on demonstrations, training a reward model on human preference rankings, and using Proximal Policy Optimisation (PPO) or similar RL algorithms to update the language model to maximise the learned reward signal.

Real-World Examples

ChatGPT's helpfulness and safety alignment were shaped primarily by RLHF
Claude uses Constitutional AI (a variant of RLHF) to stay honest and harmless
Human raters at Anthropic and OpenAI rank thousands of response pairs daily

Key Takeaway

RLHF is how AI models learn human preferences — not just correctness, but tone, safety, and helpfulness.

Related Concepts

Fine-Tuning

Claude — Constitutional AI

ChatGPT — The Generative Pre-Trained Transformer

Back to All Concepts