SantageAI Glossary › Reinforcement Learning from Human Feedback
AI Glossary

What is Reinforcement Learning from Human Feedback?

Reinforcement Learning from Human Feedback (RLHF) is a training technique where human evaluators rate AI outputs and those ratings are used to improve the model's behavior.

What is the core idea behind RLHF?

Humans teach AI what good looks like.

How do RLHF differ from related concepts?

ConceptDifference
RLHF vs Pre-trainingPre-training learns language. RLHF learns preferences
RLHF vs Fine-tuningFine-tuning uses examples. RLHF uses comparative rankings
RLHF vs Constitutional AIRLHF uses human judges. Constitutional AI uses AI judges with rules

How do RLHF work?

What are the limitations of RLHF?

Why are RLHF important?

RLHF is how ChatGPT and Claude were transformed from raw language models into useful, aligned assistants. It is the primary method for making AI helpful, harmless, and honest.

How are RLHF used in practice?

Used by OpenAI, Anthropic, Google, and most major AI labs. Variations include DPO (Direct Preference Optimization) and RLAIF (Reinforcement Learning from AI Feedback).

Frequently Asked Questions

Why is RLHF necessary?
Without RLHF, pre-trained language models often produce outputs that are technically fluent but unhelpful, evasive, or potentially harmful. RLHF teaches the model what humans actually find useful and appropriate.
Are there alternatives to RLHF?
Yes. Direct Preference Optimization (DPO), Constitutional AI, and RLAIF are actively being developed as alternatives or complements to traditional RLHF.