RLHF — Reinforcement Learning from Human Feedback
Training a New Employee with Daily Reviews
8 min read
Training a New Employee with Daily Reviews
Imagine a new hire whose manager reviews every email they draft and rates it — better and better drafts emerge over weeks.
The manager doesn't rewrite the emails. They just say 'this one is great, this one is tone-deaf, this one is too long.' The employee adjusts based on feedback. Over time, they write exactly the kind of emails the manager loves. RLHF is this process at scale — humans rate AI responses, and the AI learns to produce the kinds of responses humans prefer.
In Plain English
RLHF is how AI models like ChatGPT and Claude are trained to be helpful, harmless, and honest. Human raters compare AI responses and pick the better one, and the model is updated to produce more of what humans prefer.
The Technical Picture
RLHF involves three stages: supervised fine-tuning on demonstrations, training a reward model on human preference rankings, and using Proximal Policy Optimisation (PPO) or similar RL algorithms to update the language model to maximise the learned reward signal.
Real-World Examples
- ChatGPT's helpfulness and safety alignment were shaped primarily by RLHF
- Claude uses Constitutional AI (a variant of RLHF) to stay honest and harmless
- Human raters at Anthropic and OpenAI rank thousands of response pairs daily
RLHF is how AI models learn human preferences — not just correctness, but tone, safety, and helpfulness.