Reinforcement Learning from Human Feedback (RLHF)

Large language models have made astounding progress in generating text, but deciding what makes a “good” output is often subjective and context-dependent. Reinforcement Learning from Human Feedback (RLHF) is a technique that uses human preferences as a guide for training models – essentially using reinforcement learning (RL) to directly optimize a model’s behavior based on what humans find desirablehuggingface.co. Instead of relying on proxy metrics (like BLEU or ROUGE scores), RLHF lets us use human feedback as the reward function for the modelhuggingface.co. This approach has been critical in aligning AI systems with human values and intentions – for example, the impressive helpfulness and fluency of ChatGPT is largely attributed to RLHF fine-tuning on top of a base language model. In this post, we’ll explain how RLHF works in practical, technical terms, walk through the RLHF pipeline with a concrete example, explore real-world implementations at OpenAI, DeepMind, and Meta, and discuss tools and challenges for developers interested in RLHF.

What is RLHF?

At its core, RLHF is about treating a language model’s training process as a reinforcement learning problem where human preference is the measure of success. In standard RL, an agent learns by taking actions in an environment and receiving reward signals; in RLHF, the agent is the language model and the environment is implicitly defined by human feedback. We generate outputs from the model and ask humans (or a proxy model of human judgment) to rate or rank those outputs. These human-generated evaluations are then used as a reward signal to adjust the model’s parameters, so that it learns to produce more preferred outputs in the futurehuggingface.co. In other words, we are optimizing the model to maximize a reward that represents human preferences for what good responses look like.

This approach has proven especially useful for aligning pre-trained language models (which may output anything that was seen in their diverse internet training data) with the kind of responses humans actually want. For example, a raw GPT-3 model might generate irrelevant or unsafe answers because it was trained only to predict the next token, not to follow instructions or behave helpfully. By applying RLHF, we can fine-tune such a model so that it learns to be more helpful, correct, and harmless, as determined by human raters. This method was pioneered in research and is now a key to training aligned conversational agents – OpenAI’s InstructGPT and ChatGPT models were fine-tuned using RLHF, which dramatically improved their ability to follow user instructions and produce preferred answers openai.com adaptive-ml.com.

Key Components of an RLHF Pipeline

RLHF involves a multi-stage training process with several moving parts. Let’s break down the main components involved in the typical RLHF pipeline:

Pretrained Base Model (Policy): RLHF starts with a language model that has been pretrained on a large text corpus. This is the initial model (often called the policy in RL terms) that we want to fine-tune. It’s usually a large transformer like GPT or LLaMA with strong general language capabilities. For instance, OpenAI began RLHF experiments with a smaller version of GPT-3 for InstructGPThuggingface.co, DeepMind applied RLHF to their 280B-parameter Gopher modelhuggingface.co, and Meta used their pretrained LLaMA models as the base for LLaMA-2 Chat. The base model is powerful but “untamed” – it wasn’t explicitly trained to align with human preferences or follow instructions out-of-the-box.

Human Annotators & Preference Data: Humans play a critical role by providing the feedback data that will guide the model. There are two main kinds of human-provided data in RLHF:

Demonstrations: In many setups, human annotators first provide high-quality example responses to various prompts. These demonstrations can be used to supervise the model (via supervised fine-tuning) to give it an initial behavior closer to the desired style.
Preference Comparisons: The core of RLHF is usually a dataset of comparisons, where human labelers are shown multiple model-generated outputs for the same prompt and asked which output is better. These comparisons provide training data for the reward model. For example, OpenAI labelers would look at two answers that a GPT-3 model produced for a given question and decide which answer they preferopenai.com. The human judgments may consider criteria like relevance, correctness, clarity, and harmlessness. Each comparison effectively tells us: “Out of these outputs, this one is more aligned with what a human wants to see.”

Gathering this data is a non-trivial effort – it often involves hiring crowdworkers or domain experts and carefully defining guidelines for them. The quality and consistency of human annotations directly affect the outcome. In practice, about tens of thousands of comparison samples might be collected (OpenAI’s InstructGPT used ~40k comparisonshuggingface.co, and Anthropic has released a dataset with >160k preference labels for Helpful/Harmless dialoguehuggingface.co). This human feedback is expensive and can be slow to gather, which is one of the bottlenecks of RLHFhuggingface.co.

Reward Model (Preference Model): The next component is a reward model (RM) – a model that takes in a piece of text (often the prompt and a candidate response) and outputs a single scalar value indicating how desirable that response is. The reward model is trained on the human-provided preference data. Typically, it’s a neural network initialized from the same pretrained model (so it understands language) and then fine-tuned on the comparison dataset: the model learns to predict which output in a pair the humans preferred. Essentially, the reward model internalizes human preferences – after training, you can feed it a prompt and a candidate answer, and it will output a score (higher means “humans would like this answer better”)huggingface.co. This turns the otherwise qualitative human feedback into a quantitative reward function that our RL algorithm can work with.

The reward model is crucial but imperfect – it’s only as good as the data and criteria the humans used. It might overlook subtleties or be vulnerable to exploits (models can learn to game the reward model in unintended ways, a phenomenon known as reward hacking). Nonetheless, it provides a workable approximation of human preferences that lets us automate the evaluation of the policy’s outputs. In some projects, separate reward models are trained for different aspects, e.g. one for helpfulness and one for safetyinterconnects.ai, and their outputs might be combined.

Policy Optimization Algorithm: With a reward model in hand, we can finally fine-tune the base model using reinforcement learning. The base model, now viewed as a policy, generates outputs for given prompts, and we use the reward model’s score as the “reward signal” to adjust the policy weights. Training is often done with policy gradient methods from deep RL. A popular choice (used by OpenAI and others) is Proximal Policy Optimization (PPO)openai.com, which is a stable and efficient RL algorithm. PPO iteratively updates the policy network to maximize the expected reward, while ensuring the updates are not too large (to avoid destabilizing the model’s language generation quality). In practice, PPO in the RLHF context involves several techniques:

The policy model is initialized from the pretrained (or SFT-fine-tuned) model and then gradually optimized to get higher reward model scores.
A value function (often a copy of the model with an extra scalar head) is trained alongside to predict the reward (this helps reduce variance in training, as PPO is an actor-critic method).
A reference model (usually a frozen copy of the original model) is used to compute a KL-divergence penalty – this term in the loss ensures the policy doesn’t stray too far from the original language distributionadaptive-ml.com. This is important to prevent the fine-tuned model from over-optimizing against the reward model (which could lead to nonsensical outputs that trick the reward model). The KL penalty keeps the new policy’s answers close to what the base model might have produced, acting as a regularizer for naturalness. Most RLHF implementations include this KL term to maintain a balance between following the reward model and staying grounded in human-like textcameronrwolfe.substack.com.

Other RL algorithms can be used in place of PPO – for instance, Meta’s LLaMA-2 Chat pipeline combined rejection sampling and PPO (more on this later), and research libraries have explored techniques like A2C, TRPO, or off-policy methods like Q-learning adapted to language (e.g. ILQL – Implicit Language Q-Learning)huggingface.co huggingface.co. But PPO has become the de facto standard for online policy training in RLHF because of its proven stability and simplicity in the context of large modelsinterconnects.ai.

With these components – a base model, human feedback data, a reward model, and an RL optimizer – we have all the pieces needed to perform RLHF training.

RLHF in Practice: Fine-Tuning a Language Model (Step-by-Step)

To see how these components come together, let’s walk through a concrete example of using RLHF to fine-tune a large language model. We’ll use the case of OpenAI’s InstructGPT/ChatGPT training process as a running example, which follows a common three-step RLHF pipeline:

Illustration of the RLHF training pipeline in three phases: (1) start with a pretrained model, optionally fine-tune it on human-written demonstrations (supervised learning); (2) train a reward model by collecting human preference ratings on model outputs; (3) fine-tune the model with RL (e.g. PPO) using the reward model as the reward signal.

1. Supervised Fine-Tuning on Human Demonstrations (Optional Kick-start): First, OpenAI collected a dataset of human-written demonstration responses. These were prompts taken from the real world (user submissions) and answers written by human experts to reflect the desired style (helpful, accurate, polite, etc.). By fine-tuning the pretrained model on this dataset, the model learns to produce higher-quality responses before any RL is appliedopenai.com. This step, often called SFT (Supervised Fine-Tuning), isn’t strictly required for RLHF, but it provides a good initialization. It teaches the model the general format of following instructions and makes the subsequent RL step more stable. In our analogy, this is like showing the model what we want in a few examples so it doesn’t start the RL phase completely clueless about our preferences. After SFT, the model is already better at following instructions than the original pretrain-only model, but it can still be improved.

2. Reward Model Training with Human Preferences: Next comes building the reward model. Using the partially fine-tuned model from step 1 (or the base model if SFT was skipped), we generate lots of candidate answers for a variety of prompts. For each prompt, we might sample a few different responses (by varying the model sampling or using different model checkpoints). These responses are then shown to human annotators who rank or choose the best output among the setdeepmind.google openai.com. For example, given a prompt, a labeler might see Response A and Response B from the model and decide that A is better than B. From many such comparisons, we assemble a dataset where each data point might be “Prompt P: A > B” meaning "for prompt P, humans preferred output A over output B."

Now we train the reward model on this comparison data. Typically, the reward model is trained using a pairwise loss: the model adjusts its internal weights so that it outputs a higher score for the preferred answer and a lower score for the less-preferred answer, usually using a margin or logistic loss. After training, we have a reward model that can take a prompt-answer pair and predict a single scalar reward – ideally, a higher number if the answer is one that humans would like. In our example, OpenAI trained a reward model to predict which answer a human would prefer out of twoopenai.com. This model now stands in for “the human evaluator” for the next phase.

3. Policy Fine-Tuning with RL (PPO): With the reward model ready, we move to the reinforcement learning stage. The current policy model (from step 1) will now be optimized to maximize the reward model’s score. We use an RL algorithm (PPO) to perform this optimization in iterative roundsopenai.com:

We feed the policy model a variety of prompts (like questions or instructions) and have it generate responses.

For each generated response, we compute a reward by feeding the prompt+response into the reward model. This reward is effectively the proxy for “how good was this response according to human preferences.”

The RL algorithm then adjusts the policy weights to increase the probability of generating responses that lead to higher rewards. Concretely, PPO will calculate gradients that slightly increase the likelihood of the chosen words that produced a good outcome and decrease the likelihood of choices that led to poor outcomes, taking care to keep changes within a safe range (that’s PPO’s trust region aspect). A value function (critic) predicts the reward to help stabilize learning, and a penalty ensures the policy doesn’t drift too far from the original model’s distribution (avoiding gibberish or off-topic rambling)adaptive-ml.com.

This loop repeats for many iterations (sampling prompts, generating responses, getting rewards, updating the model). Over time, the model learns to output answers that score better and better on the reward model – meaning they increasingly align with the preferences encoded by the human raters.

OpenAI’s InstructGPT work followed this process: starting with a GPT-3 model, fine-tuning on human demos, then using PPO with a learned reward model to significantly boost the model’s helpfulness and reduce undesirable outputsopenai.com. By the end of this training, the policy model has essentially absorbed the reward model’s judgment. The result was a model that humans overwhelmingly preferred over the original GPT-3 when given instruction-following tasksopenai.com. This RLHF-tuned model is what became ChatGPT after further refinements – a model that is much more aligned with user needs and values.

Real-World Applications of RLHF

RLHF has moved from research labs to real production systems, especially in the realm of large language models and conversational AI. Here we highlight how a few major AI organizations have applied RLHF in practice:

OpenAI: InstructGPT and ChatGPT

OpenAI has been at the forefront of RLHF for language models. Their January 2022 paper “Training Language Models to Follow Instructions with Human Feedback” introduced InstructGPT, which was a version of GPT-3 fine-tuned with RLHFopenai.com. They collected human demonstrations and comparison data via their API (prompts and model outputs labeled by humans) and trained both a reward model and the GPT-3 policy using PPOopenai.com. The resulting InstructGPT models were significantly preferred by users over the original GPT-3, following instructions much more reliably and producing fewer toxic or factually wrong outputsopenai.com. InstructGPT also showed gains in truthfulness (e.g., higher scores on TruthfulQA) and a reduction in harmful completions compared to GPT-3openai.com.

Building on that success, OpenAI applied the same RLHF recipe at a larger scale to create ChatGPT (first released in late 2022). ChatGPT can be seen as a direct descendant of InstructGPT, further refined for dialogue. It was fine-tuned with human feedback to produce answers in a conversational format, handle follow-up questions, refuse inappropriate requests, etc. According to OpenAI, RLHF was the key technique that “unlocked” ChatGPT’s capabilities to align with user intentions in dialogueadaptive-ml.com. The model’s behavior – being helpful, detailed, and comparatively safe – is largely due to this alignment process. In fact, RLHF has been so effective that it’s now considered the go-to method for aligning large language models with human intentopenai.com. However, OpenAI also acknowledges that RLHF is not perfect; ChatGPT and its successors still have limitations (they can produce errors or be coaxed into problematic outputs), and research is ongoing to make alignment more robust.

Google DeepMind: Sparrow and Beyond

Google’s DeepMind has also explored RLHF to align dialogue agents with human preferences, particularly focusing on safety. A notable example is Sparrow, a dialogue agent described in a 2022 DeepMind research paper. Sparrow was designed to be more helpful, correct, and harmless by learning from human feedbackdeepmind.google deepmind.google. In training, human participants were asked to chat with Sparrow and then provide preference judgments on Sparrow’s answers (for instance, preferring answers that were more accurate or followed certain rules). DeepMind used these preferences to train a reward model of answer “usefulness,” and then fine-tuned Sparrow’s dialogue policy with RL to maximize this rewarddeepmind.google. They also introduced a set of rules (like “don’t give hateful or harassing responses” and “don’t claim to be a person”) and trained an additional classifier (a “rule model”) to penalize rule-breaking answersdeepmind.google. By combining RLHF with rule-based constraints, Sparrow achieved much safer behavior: it was significantly less likely to produce disallowed responses compared to a baseline model, while still giving helpful answers with supporting evidence in many casesdeepmind.google deepmind.google.

Beyond Sparrow, DeepMind and Google have used human feedback in other domains – for example, learning from human preferences to improve summary quality (as in the Learning to Summarize with Human Feedback project by OpenAI/DeepMindopenai.com) and to align models like ChatGPT’s upcoming competitors. The general pattern is the same: start with a powerful pretrained model and reinforce its good behaviors via human-scored feedback. As models approach human-level competency on many tasks, feedback from humans (or human-trained models) will be an essential tool to ensure these AI systems remain helpful and safe.

Meta: LLaMA 2-Chat

Meta (Facebook) has embraced RLHF in the development of its open large language models. In July 2023, Meta released LLaMA 2, including fine-tuned chat variants of those models (7B, 13B, and 70B parameter versions) that were optimized for dialogue. The LLaMA-2-Chat models were trained with a combination of supervised fine-tuning and RLHF to align the base LLaMA models with human preferences for helpfulness and safetyhuggingface.co. According to Meta’s documentation, they collected over one million human annotations (prompts and preferred responses) to build reward models for different axes like helpfulness and harmlessnessinterconnects.ai interconnects.ai. Uniquely, Meta applied RLHF in multiple stages: they first did several rounds of Rejection Sampling fine-tuning, and then a final round of PPO traininginterconnects.ai. In the rejection sampling phase, for a given prompt, the current model would generate K possible responses; the reward model would score them, and the highest-scoring response was used as a pseudo-target to further fine-tune the model (this is like “best-of-N” optimization using the reward model as a filter). They repeated this process (iteratively improving the model and the reward model), and only after a few cycles did they apply PPO on top of the refined model for an extra boost in alignmentarxiv.org interconnects.ai. This approach allowed Meta to utilize the reward model’s judgments in an offline way (reusing generated data) before doing the more expensive online RL. The end result, LLaMA-2-Chat, has demonstrated competitive performance with other AI chatbots and significantly improved safety behavior versus the base modelshuggingface.co huggingface.co. Meta’s deployment of RLHF in an open model is notable because they also open-sourced the data (to an extent) – enabling the community to study and build upon their human feedback efforts.

Other examples: Many other organizations are exploring RLHF. Anthropic’s Claude assistant is trained with a variant of RLHF (they call it “Constitutional AI” – using an AI feedback model guided by written principles instead of direct human comparison for some stages). Microsoft has used RLHF techniques in aligning Bing Chat (which is built on OpenAI’s models). Even smaller projects and academia have tried applying RLHF to fine-tune open-source models (like OpenAssistant and Stanford’s Alpaca) using human-like preference data. The common theme is that RLHF provides a powerful lever to adjust model behavior in ways pure pretraining cannot, by explicitly optimizing for what humans want.

Tools, Libraries, and Resources for RLHF

For developers interested in experimenting with RLHF, there are a growing number of open-source tools and frameworks that simplify the process of reward modeling and RL fine-tuning:

OpenAI Baselines (2019): OpenAI released early code for RLHF in 2019 (in TensorFlow) as part of their research into fine-tuning language models from human preferenceshuggingface.co. While not widely used today, it was one of the first reference implementations of PPO for language generation and helped inspire other projects.

Hugging Face TRL (Transformer RL): The Hugging Face team provides the trl library, which is a popular toolkit for RLHF built on PyTorch. TRL (originally developed by Hugging Face and collaborators) makes it easy to take a pretrained Transformer (from the Hugging Face Hub) and fine-tune it with PPO using a custom reward functionhuggingface.co. It abstracts a lot of boilerplate – you define your reward_model (or a function) and the PPOTrainer helps handle the generation of text, computing rewards, and backpropagating through the model. Developers have used TRL to replicate results like OpenAI’s summarization with human feedback and to fine-tune chat models on smaller scales.

TRLX (by CarperAI): trlx is an extended fork of TRL developed by CarperAI (an OpenAI/EleutherAI-affiliated research group)huggingface.co. It was designed to handle larger models and more advanced algorithms. TRLX provides support for distributed training and can work with models up to tens of billions of parameters (they mention plans up to 200B)huggingface.co. It also includes implementations of algorithms beyond PPO – for example, ILQL (Implicit Language Q-Learning), an offline RL method that can fine-tune a model using a static dataset of (prompt, response, reward) tupleshuggingface.co. TRLX is geared towards researchers/practitioners who want to try cutting-edge RLHF techniques at scale.

RL4LMs: Another library is RL4LMs (Reinforcement Learning for Language Models), which provides a flexible framework to plug in different RL algorithms and reward definitions for language taskshuggingface.co. RL4LMs comes with support for multiple algorithms (PPO, A2C, DQN-style, etc.) and has been used to systematically study RLHF on various taskshuggingface.co. It emphasizes evaluation and research insights, offering benchmarks to detect issues like reward hacking or to compare using human demonstrations versus reward modeling datahuggingface.co. This library is great if you want to experiment beyond PPO or test new ideas for reward functions in a research context.

DeepSpeed-Chat (Microsoft): In 2023, Microsoft open-sourced DeepSpeed-Chat, an end-to-end toolkit for RLHF that leverages the DeepSpeed library for efficient large-scale trainingmedium.com. DeepSpeed-Chat provides a highly optimized RLHF pipeline (with support for multi-node distributed training, memory optimization, etc.) so that even very large models (hundreds of billions of parameters) can be fine-tuned with relatively modest infrastructuremedium.com. Their goal is to democratize RLHF training, making it “one-click” to train your own ChatGPT-like model by handling the engineering heavy lifting. If you have access to some GPU hardware, DeepSpeed-Chat could significantly speed up RLHF experiments by using 8-bit optimizers, CPU offloading, and other tricks under the hood.

Human Feedback Datasets: Besides code, data is an important piece. Anthropic’s HH-RLHF dataset is a public dataset containing human preference comparisons for helpful and harmless dialogue responseshuggingface.co. It’s a valuable resource if you want to try RLHF on dialogue without collecting your own data – it includes prompts and several model replies with labels of which reply was preferred. OpenAI has also released a smaller dataset of human comparisons for summary tasks on Reddit (from their “Learning to Summarize” work)huggingface.co. Furthermore, community projects like OpenAI’s OpenFeedback or OpenAssistant have crowd-sourced some preference data. These datasets let you train reward models or even try offline RLHF approaches. When using them, keep in mind they reflect the preferences of the annotators involved (Anthropic’s data, for example, is focused on English helpers with a specific style).

Guides and Miscellaneous: There are “Awesome RLHF” listsgithub.com that compile papers, blogs, and tools. The Hugging Face blog Illustrating RLHFhuggingface.co and Chip Huyen’s RLHF Explained posthuyenchip.com are excellent reads to deepen understanding. OpenAI’s and DeepMind’s research papers (like InstructGPTopenai.com, Deep RL from Human Preferenceshuyenchip.com, etc.) are great primary sources. Many of the libraries above also have example scripts – for instance, the TRL repo shows how to fine-tune GPT-2 on a toy preference task. By leveraging these open-source resources, developers can start playing with RLHF on smaller models to get a feel for it, even if they can’t match the scale of OpenAI or Meta’s projects.

Limitations and Challenges of RLHF

While RLHF has enabled large leaps in aligning AI behavior with human wishes, it comes with several limitations and challenges that are important to understand:

Imperfect Alignment and Safety: Models fine-tuned with RLHF are better at following instructions and avoiding blatantly bad outputs, but they are far from fully safe or truthful. They can still produce factually incorrect statements (hallucinations) or biased/harmful content in some situationshuggingface.co. RLHF optimizes for the preferences of the human raters (and the reward model), but if those preferences or the training process don’t cover a scenario, the model may still fail. InstructGPT, for example, was found to occasionally follow user instructions too literally into unsafe territory – the so-called sycophancy or misuse vulnerability (the model will do what a user asks even if it’s harmful)openai.com. This happens because the model was trained to make users happy (follow instructions), and refusing requests reliably is a separate challenge. Overall, RLHF does not guarantee correctness or morality; it just tilts the model towards the behaviors captured in the training data. It remains an open problem to create models that know when they don’t know or that can navigate ethical dilemmas – RLHF reduces the rate of bad outputs but doesn’t eliminate themhuggingface.co.

Data Bottlenecks and Cost: RLHF is heavily dependent on high-quality human annotations. Obtaining tens of thousands of careful comparison labels or written demonstrations can be expensive and time-consuminghuggingface.co. Unlike pretraining data (which can be collected en masse from the web), preference data requires humans in the loop for each sample. This inherently limits how far an academic or small team can go in applying RLHF – the big successes (ChatGPT, Sparrow, LLaMA-2) have been driven by industrial-scale annotation efforts. There’s also the issue of annotator consistency: different humans might disagree on what response is best, especially for subjective or complex queries. Indeed, preference datasets often have significant variance – what one labeler prefers, another might nothuggingface.co. This noise can hinder reward model training. It also raises the question: whose preferences are we aligning to? If the annotator pool isn’t diverse, the model might become aligned to a narrow set of values and perform poorly for other users or demographics. OpenAI found that aligning to their hired labelers generalized to some extent, but they caution that broader or different groups might not agree with all decisionsopenai.com openai.com. Thus, RLHF currently faces a scalability issue in obtaining enough high-quality, representative feedback to cover the vast space of possible inputs.

Reward Model Limitations and Gaming: The reward model is a learned proxy for human judgment – and like any ML model, it can be flawed. If the reward model doesn’t perfectly capture human preferences, the policy might exploit loopholes. This is analogous to how a game AI might find an unintended strategy that scores points. In language, a policy might learn to output text that superficially looks good to the reward model but isn’t truly helpful (since the reward model might be misled by certain phrasing or might not detect subtle errors). This is known as reward hacking. Researchers have indeed observed signs of this: for instance, a policy might excessively pad its answers with certain polite phrases that the reward model associates with good answers, without actually improving the content. Mitigating this requires careful reward model training and often the KL regularization mentioned earlier to keep the model from devolving into bizarre but high-scoring outputs. Additionally, reward models tend to be brittle – they might assign high reward outside the range of scenarios they were trained on, which can lead the policy astray. Techniques like penalizing the KL divergence from the base model help, but they don’t solve the issue entirely. Ongoing research (and libraries like RL4LMs) has focused on detecting and addressing reward hacking and training instabilityhuggingface.co.

Stability of RL Training: Training large language models with RLHF can be finicky. Language models are high-dimensional and were originally trained with supervised learning; forcing them into an RL loop sometimes leads to instability (e.g. divergence, oscillations in behavior). PPO is fairly stable, but it introduces several hyperparameters (learning rate, reward normalization, KL coefficient, etc.) that need to be balanced. If the RL step is pushed too hard, the model’s grammar or coherence can break. Empirically, many teams found they had to do RLHF somewhat gently – for example, OpenAI mentioned using only a few epochs of PPO and mixing in some supervised learning data to avoid catastrophic forgetting of general language skillsopenai.com. The alignment tax is a known phenomenon: focusing on the human-preference objective can sometimes reduce performance on other tasks or make the model less curious in its outputsopenai.com. OpenAI mitigated this by mixing a bit of the original pretraining data during RLHF fine-tuning (to retain general abilities)openai.com. Balancing multiple objectives is tricky and somewhat task-specific.

Generalization of Preferences: An RLHF-tuned model is aligned to the training annotators and the scenarios seen. That doesn’t guarantee it will behave as desired in novel situations. If a user asks something that wasn’t covered in the training distribution, the model might revert to unwanted behavior or simply make something up. Also, preferences can change: what if society’s norms evolve, or you want to deploy the model in a different culture or language? The model might then be misaligned for those new settings. Adapting or re-training the reward model for new preferences is another challenge (it would need more human data). There’s active research on how to allow users to personalize the values a model follows, or to condition the model on different “preference profiles,” but doing this robustly is unsolved.

In summary, RLHF is not a silver bullet – it introduces its own set of issues even as it solves others. It shifts the problem from “designing a perfect reward function” to “collecting good data and training a good reward model,” which is hard but arguably more tractable for complex tasks like language. Many of the challenges above are topics of ongoing research in the AI alignment community.

Future Directions and Emerging Ideas

RLHF is a young, rapidly evolving field, and there are several promising directions that aim to address its current shortcomings and extend its capabilities:

More Efficient and Robust RL Algorithms: Since PPO was borrowed from the RL community without being designed specifically for language generation, researchers are exploring alternatives that might be more sample-efficient or stable for RLHF. One example is ILQL (Implicit Language Q-Learning), an offline RL algorithm that learns a Q-value (or advantage) function on fixed datasets of (prompt, response, reward)huggingface.co. ILQL can utilize all the accumulated (prompt, output, reward) data without needing to constantly query the live model for new samples, which can make training cheaper (no need to run the huge model for every step). Early work by CarperAI showed ILQL can achieve similar results to PPO for smaller-scale RLHF tasks, and it’s supported in the TRLX libraryhuggingface.co. Other algorithms being tried include A2C (Advantage Actor-Critic) and variations of policy gradient with model-based critics. Moreover, researchers are revisiting how to better integrate uncertainty estimation in the reward model – if the reward model is unsure, the policy could be penalized for overconfidently optimizing that area. We might also see algorithms that explicitly handle the exploration problem in language (ensuring the model tries a variety of response styles instead of prematurely converging). As noted in a 2022 analysis, many RLHF design choices are not fully explored yet, and there’s room for improved optimizers beyond PPOhuggingface.co huggingface.co.

Scaling Human Feedback with AI and Automation: One way to overcome the data bottleneck is to use AI to assist or replace human feedback in some stages. Anthropic’s Constitutional AI is one approach: they generate feedback by using the model itself (or another AI) to critique outputs against a set of written principles, thus creating a sort of AI feedback loop guided by a "constitution" of rules. OpenAI has also begun integrating rule-based feedback: for instance, their 2024 research on Rule-Based Rewards (RBR) uses a set of human-written rules (like “the response should contain an apology if it’s a refusal”) to automatically judge certain aspects of output, which can supplement or replace human labelers for routine safety enforcementopenai.com openai.com. These rules or AI evaluators can be plugged into the RLHF pipeline as additional reward signals. The advantage is that rules can be updated instantly (if policy changes) without recollecting human dataopenai.com. In practice, OpenAI reported using RBR in combination with traditional RLHF for training safer behaviors in GPT-4openai.com. We expect to see hybrid approaches where human feedback is used for subtle, hard-to-formalize judgments, and automated rewards (from heuristic rules or other models) are used for straightforward criteria. This can make alignment training more scalable and updatable.

Continuous and Adaptive Feedback (Online RLHF): So far, most RLHF training is done offline: collect data once, train the model, deploy it. An intriguing direction is making the model learn on the fly from user interactions. Imagine if a deployed chatbot could ask users for feedback or notice implicit signals (like whether the user rephrased a question, indicating the first answer wasn’t good) and update itself continuously. This iterative online RLHF could use techniques like online reward model updates or bandit algorithms to fine-tune with a live feedback stream. Anthropic and others have discussed the idea of ELO-style rating systems, where models continuously get compared and ranked as they chat with usershuggingface.co. However, doing this safely is challenging: learning on the fly risks destabilizing the model, and there’s a danger of the model being influenced by a small subset of users or drifting from its initial safety alignment. The dynamics of a model that is updating itself based on user input create new complexities (the model’s behavior and the feedback it gets are interdependent)huggingface.co. In the near term, we might see more controlled versions of this, such as periodic retraining using logs of real user conversations (with crowdworkers labeling those after the fact). Over the long term, reinforcement learning at deployment could allow models to personalize and improve, if we can ensure they do so without catastrophically forgetting or misbehaving.

Understanding and Improving Reward Models: Since reward models are central to RLHF, another future direction is improving how we train and use them. One idea is to make reward models that factor in uncertainty or that can say “I’m not sure” when comparisons are ambiguous. Another idea is to use larger language models as reward models – for example, using GPT-4 to judge the outputs of a smaller model. This was actually tested in the LLaMA-2 paper, where they found a GPT-4 based reward model was a strong baseline (though a dedicated reward model fine-tuned on human data could outperform it in their setting)interconnects.ai. Using powerful models to evaluate others (sometimes called AI feedback, as opposed to human feedback) is promising, especially if the powerful model has been aligned with human values. We might also see multi-objective reward models (combining helpfulness, correctness, stylistic preferences, etc.) or methods to calibrate reward models so their scores more directly translate to human satisfaction measures.

Transparency and Interpretability: Going forward, there’s interest in making RLHF-trained models more interpretable. Because the policy is being influenced by the reward model, researchers are analyzing what exactly the model learns during RLHF. Some studies have looked at whether RLHF mainly affects the style vs. substance of responses. Others have tried to extract rules the model seems to follow after RLHF (for example, “always apologize when refusing” as a learned behavior). By understanding this, we can potentially shape reward functions more intelligently. Also, techniques like mechanistic interpretability (opening up the model’s neurons) could identify if RLHF is causing any undesired biases to strengthen or if certain circuits are overly optimized for the reward at the expense of truth. This is still very much research, but as RLHF becomes a staple of alignment, tools to audit and verify the aligned models will be crucial.

Broader/Conditional Preference Alignment: Currently, a model like ChatGPT is aligned to be a generally helpful AI for a broad user base. In the future, we might want models that can adapt to different ethical frameworks or personal preferences. There’s preliminary research into letting users set their own “AI Constitution” or sliders for the assistant’s behavior. RLHF could be extended by having multiple reward models (e.g., one reflecting one set of values, another reflecting a different set) and somehow allowing the model to switch or interpolate between them based on context. Achieving this without retraining from scratch each time is tricky, but it could involve conditional training or meta-learning. It ties into the question of how to align AI with plurality of human values rather than a single aggregated preference. OpenAI’s work noted that their labelers’ preferences might not represent all usersopenai.com, implying a need to diversify feedback sources or provide customization.

In summary, the future of RLHF is likely to involve more automation, more nuanced objectives, and more integration with other AI safety techniques. RLHF has been a game-changer in making AI assistants useful, but scaling it (both in terms of data and quality) will require creative solutions. Techniques like rule-based rewards, AI-generated feedback, offline RL methods, and continuous learning are at the cutting edge of researchopenai.com huggingface.co. For developers, this means the landscape of “fine-tuning AI with feedback” will keep evolving – expect new libraries and tools that implement these advanced methods as they become validated.

Conclusion: RLHF offers a practical way to align AI systems with human needs by making use of human judgments directly in the training loop. For developers, it provides a powerful lever to fine-tune large models beyond what can be achieved with supervised learning alone: you can define what you want from the model in terms of a reward function and push the model in that direction. We’ve seen how companies like OpenAI, DeepMind, and Meta leveraged RLHF to create more helpful and safer AI, and we’ve discussed how you can try these techniques yourself with modern frameworks. However, it’s important to remain aware of the limitations – an RLHF-tuned model is not infallible or “aligned” in a broad sense, it’s just biased towards the feedback it was given. Ensuring these systems behave well in the real world will likely require combining RLHF with other strategies and a lot more research.

RLHF is an exciting interface between human values and machine learning. As AI developers, being able to integrate human-in-the-loop training is increasingly becoming a key skill. With more open datasets and libraries emerging, even small teams can start to experiment with aligning AI behaviors with human feedback. The lessons learned from RLHF in language models are also beginning to transfer to other domains (like robotics and recommendation systems), wherever human preferences are crucial. By following the progress in this area and contributing feedback of our own, we collectively move towards AI that is not just smart, but also truly user-centered in its behavior.

What is RLHF?

Key Components of an RLHF Pipeline

RLHF involves a multi-stage training process with several moving parts. Let’s break down the main components involved in the typical RLHF pipeline:

Pretrained Base Model (Policy): RLHF starts with a language model that has been pretrained on a large text corpus. This is the initial model (often called the policy in RL terms) that we want to fine-tune. It’s usually a large transformer like GPT or LLaMA with strong general language capabilities. For instance, OpenAI began RLHF experiments with a smaller version of GPT-3 for InstructGPThuggingface.co, DeepMind applied RLHF to their 280B-parameter Gopher modelhuggingface.co, and Meta used their pretrained LLaMA models as the base for LLaMA-2 Chat. The base model is powerful but “untamed” – it wasn’t explicitly trained to align with human preferences or follow instructions out-of-the-box.

Human Annotators & Preference Data: Humans play a critical role by providing the feedback data that will guide the model. There are two main kinds of human-provided data in RLHF:

Demonstrations: In many setups, human annotators first provide high-quality example responses to various prompts. These demonstrations can be used to supervise the model (via supervised fine-tuning) to give it an initial behavior closer to the desired style.
Preference Comparisons: The core of RLHF is usually a dataset of comparisons, where human labelers are shown multiple model-generated outputs for the same prompt and asked which output is better. These comparisons provide training data for the reward model. For example, OpenAI labelers would look at two answers that a GPT-3 model produced for a given question and decide which answer they preferopenai.com. The human judgments may consider criteria like relevance, correctness, clarity, and harmlessness. Each comparison effectively tells us: “Out of these outputs, this one is more aligned with what a human wants to see.”

Reward Model (Preference Model): The next component is a reward model (RM) – a model that takes in a piece of text (often the prompt and a candidate response) and outputs a single scalar value indicating how desirable that response is. The reward model is trained on the human-provided preference data. Typically, it’s a neural network initialized from the same pretrained model (so it understands language) and then fine-tuned on the comparison dataset: the model learns to predict which output in a pair the humans preferred. Essentially, the reward model internalizes human preferences – after training, you can feed it a prompt and a candidate answer, and it will output a score (higher means “humans would like this answer better”)huggingface.co. This turns the otherwise qualitative human feedback into a quantitative reward function that our RL algorithm can work with.

Policy Optimization Algorithm: With a reward model in hand, we can finally fine-tune the base model using reinforcement learning. The base model, now viewed as a policy, generates outputs for given prompts, and we use the reward model’s score as the “reward signal” to adjust the policy weights. Training is often done with policy gradient methods from deep RL. A popular choice (used by OpenAI and others) is Proximal Policy Optimization (PPO)openai.com, which is a stable and efficient RL algorithm. PPO iteratively updates the policy network to maximize the expected reward, while ensuring the updates are not too large (to avoid destabilizing the model’s language generation quality). In practice, PPO in the RLHF context involves several techniques:

The policy model is initialized from the pretrained (or SFT-fine-tuned) model and then gradually optimized to get higher reward model scores.
A value function (often a copy of the model with an extra scalar head) is trained alongside to predict the reward (this helps reduce variance in training, as PPO is an actor-critic method).
A reference model (usually a frozen copy of the original model) is used to compute a KL-divergence penalty – this term in the loss ensures the policy doesn’t stray too far from the original language distributionadaptive-ml.com. This is important to prevent the fine-tuned model from over-optimizing against the reward model (which could lead to nonsensical outputs that trick the reward model). The KL penalty keeps the new policy’s answers close to what the base model might have produced, acting as a regularizer for naturalness. Most RLHF implementations include this KL term to maintain a balance between following the reward model and staying grounded in human-like textcameronrwolfe.substack.com.

With these components – a base model, human feedback data, a reward model, and an RL optimizer – we have all the pieces needed to perform RLHF training.

RLHF in Practice: Fine-Tuning a Language Model (Step-by-Step)

We feed the policy model a variety of prompts (like questions or instructions) and have it generate responses.

For each generated response, we compute a reward by feeding the prompt+response into the reward model. This reward is effectively the proxy for “how good was this response according to human preferences.”

The RL algorithm then adjusts the policy weights to increase the probability of generating responses that lead to higher rewards. Concretely, PPO will calculate gradients that slightly increase the likelihood of the chosen words that produced a good outcome and decrease the likelihood of choices that led to poor outcomes, taking care to keep changes within a safe range (that’s PPO’s trust region aspect). A value function (critic) predicts the reward to help stabilize learning, and a penalty ensures the policy doesn’t drift too far from the original model’s distribution (avoiding gibberish or off-topic rambling)adaptive-ml.com.

This loop repeats for many iterations (sampling prompts, generating responses, getting rewards, updating the model). Over time, the model learns to output answers that score better and better on the reward model – meaning they increasingly align with the preferences encoded by the human raters.

Real-World Applications of RLHF

OpenAI: InstructGPT and ChatGPT

Google DeepMind: Sparrow and Beyond

Meta: LLaMA 2-Chat

Tools, Libraries, and Resources for RLHF

For developers interested in experimenting with RLHF, there are a growing number of open-source tools and frameworks that simplify the process of reward modeling and RL fine-tuning:

OpenAI Baselines (2019): OpenAI released early code for RLHF in 2019 (in TensorFlow) as part of their research into fine-tuning language models from human preferenceshuggingface.co. While not widely used today, it was one of the first reference implementations of PPO for language generation and helped inspire other projects.

Hugging Face TRL (Transformer RL): The Hugging Face team provides the trl library, which is a popular toolkit for RLHF built on PyTorch. TRL (originally developed by Hugging Face and collaborators) makes it easy to take a pretrained Transformer (from the Hugging Face Hub) and fine-tune it with PPO using a custom reward functionhuggingface.co. It abstracts a lot of boilerplate – you define your reward_model (or a function) and the PPOTrainer helps handle the generation of text, computing rewards, and backpropagating through the model. Developers have used TRL to replicate results like OpenAI’s summarization with human feedback and to fine-tune chat models on smaller scales.

TRLX (by CarperAI): trlx is an extended fork of TRL developed by CarperAI (an OpenAI/EleutherAI-affiliated research group)huggingface.co. It was designed to handle larger models and more advanced algorithms. TRLX provides support for distributed training and can work with models up to tens of billions of parameters (they mention plans up to 200B)huggingface.co. It also includes implementations of algorithms beyond PPO – for example, ILQL (Implicit Language Q-Learning), an offline RL method that can fine-tune a model using a static dataset of (prompt, response, reward) tupleshuggingface.co. TRLX is geared towards researchers/practitioners who want to try cutting-edge RLHF techniques at scale.

RL4LMs: Another library is RL4LMs (Reinforcement Learning for Language Models), which provides a flexible framework to plug in different RL algorithms and reward definitions for language taskshuggingface.co. RL4LMs comes with support for multiple algorithms (PPO, A2C, DQN-style, etc.) and has been used to systematically study RLHF on various taskshuggingface.co. It emphasizes evaluation and research insights, offering benchmarks to detect issues like reward hacking or to compare using human demonstrations versus reward modeling datahuggingface.co. This library is great if you want to experiment beyond PPO or test new ideas for reward functions in a research context.

DeepSpeed-Chat (Microsoft): In 2023, Microsoft open-sourced DeepSpeed-Chat, an end-to-end toolkit for RLHF that leverages the DeepSpeed library for efficient large-scale trainingmedium.com. DeepSpeed-Chat provides a highly optimized RLHF pipeline (with support for multi-node distributed training, memory optimization, etc.) so that even very large models (hundreds of billions of parameters) can be fine-tuned with relatively modest infrastructuremedium.com. Their goal is to democratize RLHF training, making it “one-click” to train your own ChatGPT-like model by handling the engineering heavy lifting. If you have access to some GPU hardware, DeepSpeed-Chat could significantly speed up RLHF experiments by using 8-bit optimizers, CPU offloading, and other tricks under the hood.

Human Feedback Datasets: Besides code, data is an important piece. Anthropic’s HH-RLHF dataset is a public dataset containing human preference comparisons for helpful and harmless dialogue responseshuggingface.co. It’s a valuable resource if you want to try RLHF on dialogue without collecting your own data – it includes prompts and several model replies with labels of which reply was preferred. OpenAI has also released a smaller dataset of human comparisons for summary tasks on Reddit (from their “Learning to Summarize” work)huggingface.co. Furthermore, community projects like OpenAI’s OpenFeedback or OpenAssistant have crowd-sourced some preference data. These datasets let you train reward models or even try offline RLHF approaches. When using them, keep in mind they reflect the preferences of the annotators involved (Anthropic’s data, for example, is focused on English helpers with a specific style).

Guides and Miscellaneous: There are “Awesome RLHF” listsgithub.com that compile papers, blogs, and tools. The Hugging Face blog Illustrating RLHFhuggingface.co and Chip Huyen’s RLHF Explained posthuyenchip.com are excellent reads to deepen understanding. OpenAI’s and DeepMind’s research papers (like InstructGPTopenai.com, Deep RL from Human Preferenceshuyenchip.com, etc.) are great primary sources. Many of the libraries above also have example scripts – for instance, the TRL repo shows how to fine-tune GPT-2 on a toy preference task. By leveraging these open-source resources, developers can start playing with RLHF on smaller models to get a feel for it, even if they can’t match the scale of OpenAI or Meta’s projects.

Limitations and Challenges of RLHF

While RLHF has enabled large leaps in aligning AI behavior with human wishes, it comes with several limitations and challenges that are important to understand:

Imperfect Alignment and Safety: Models fine-tuned with RLHF are better at following instructions and avoiding blatantly bad outputs, but they are far from fully safe or truthful. They can still produce factually incorrect statements (hallucinations) or biased/harmful content in some situationshuggingface.co. RLHF optimizes for the preferences of the human raters (and the reward model), but if those preferences or the training process don’t cover a scenario, the model may still fail. InstructGPT, for example, was found to occasionally follow user instructions too literally into unsafe territory – the so-called sycophancy or misuse vulnerability (the model will do what a user asks even if it’s harmful)openai.com. This happens because the model was trained to make users happy (follow instructions), and refusing requests reliably is a separate challenge. Overall, RLHF does not guarantee correctness or morality; it just tilts the model towards the behaviors captured in the training data. It remains an open problem to create models that know when they don’t know or that can navigate ethical dilemmas – RLHF reduces the rate of bad outputs but doesn’t eliminate themhuggingface.co.

Data Bottlenecks and Cost: RLHF is heavily dependent on high-quality human annotations. Obtaining tens of thousands of careful comparison labels or written demonstrations can be expensive and time-consuminghuggingface.co. Unlike pretraining data (which can be collected en masse from the web), preference data requires humans in the loop for each sample. This inherently limits how far an academic or small team can go in applying RLHF – the big successes (ChatGPT, Sparrow, LLaMA-2) have been driven by industrial-scale annotation efforts. There’s also the issue of annotator consistency: different humans might disagree on what response is best, especially for subjective or complex queries. Indeed, preference datasets often have significant variance – what one labeler prefers, another might nothuggingface.co. This noise can hinder reward model training. It also raises the question: whose preferences are we aligning to? If the annotator pool isn’t diverse, the model might become aligned to a narrow set of values and perform poorly for other users or demographics. OpenAI found that aligning to their hired labelers generalized to some extent, but they caution that broader or different groups might not agree with all decisionsopenai.com openai.com. Thus, RLHF currently faces a scalability issue in obtaining enough high-quality, representative feedback to cover the vast space of possible inputs.

Reward Model Limitations and Gaming: The reward model is a learned proxy for human judgment – and like any ML model, it can be flawed. If the reward model doesn’t perfectly capture human preferences, the policy might exploit loopholes. This is analogous to how a game AI might find an unintended strategy that scores points. In language, a policy might learn to output text that superficially looks good to the reward model but isn’t truly helpful (since the reward model might be misled by certain phrasing or might not detect subtle errors). This is known as reward hacking. Researchers have indeed observed signs of this: for instance, a policy might excessively pad its answers with certain polite phrases that the reward model associates with good answers, without actually improving the content. Mitigating this requires careful reward model training and often the KL regularization mentioned earlier to keep the model from devolving into bizarre but high-scoring outputs. Additionally, reward models tend to be brittle – they might assign high reward outside the range of scenarios they were trained on, which can lead the policy astray. Techniques like penalizing the KL divergence from the base model help, but they don’t solve the issue entirely. Ongoing research (and libraries like RL4LMs) has focused on detecting and addressing reward hacking and training instabilityhuggingface.co.

Stability of RL Training: Training large language models with RLHF can be finicky. Language models are high-dimensional and were originally trained with supervised learning; forcing them into an RL loop sometimes leads to instability (e.g. divergence, oscillations in behavior). PPO is fairly stable, but it introduces several hyperparameters (learning rate, reward normalization, KL coefficient, etc.) that need to be balanced. If the RL step is pushed too hard, the model’s grammar or coherence can break. Empirically, many teams found they had to do RLHF somewhat gently – for example, OpenAI mentioned using only a few epochs of PPO and mixing in some supervised learning data to avoid catastrophic forgetting of general language skillsopenai.com. The alignment tax is a known phenomenon: focusing on the human-preference objective can sometimes reduce performance on other tasks or make the model less curious in its outputsopenai.com. OpenAI mitigated this by mixing a bit of the original pretraining data during RLHF fine-tuning (to retain general abilities)openai.com. Balancing multiple objectives is tricky and somewhat task-specific.

Generalization of Preferences: An RLHF-tuned model is aligned to the training annotators and the scenarios seen. That doesn’t guarantee it will behave as desired in novel situations. If a user asks something that wasn’t covered in the training distribution, the model might revert to unwanted behavior or simply make something up. Also, preferences can change: what if society’s norms evolve, or you want to deploy the model in a different culture or language? The model might then be misaligned for those new settings. Adapting or re-training the reward model for new preferences is another challenge (it would need more human data). There’s active research on how to allow users to personalize the values a model follows, or to condition the model on different “preference profiles,” but doing this robustly is unsolved.

Future Directions and Emerging Ideas

RLHF is a young, rapidly evolving field, and there are several promising directions that aim to address its current shortcomings and extend its capabilities:

More Efficient and Robust RL Algorithms: Since PPO was borrowed from the RL community without being designed specifically for language generation, researchers are exploring alternatives that might be more sample-efficient or stable for RLHF. One example is ILQL (Implicit Language Q-Learning), an offline RL algorithm that learns a Q-value (or advantage) function on fixed datasets of (prompt, response, reward)huggingface.co. ILQL can utilize all the accumulated (prompt, output, reward) data without needing to constantly query the live model for new samples, which can make training cheaper (no need to run the huge model for every step). Early work by CarperAI showed ILQL can achieve similar results to PPO for smaller-scale RLHF tasks, and it’s supported in the TRLX libraryhuggingface.co. Other algorithms being tried include A2C (Advantage Actor-Critic) and variations of policy gradient with model-based critics. Moreover, researchers are revisiting how to better integrate uncertainty estimation in the reward model – if the reward model is unsure, the policy could be penalized for overconfidently optimizing that area. We might also see algorithms that explicitly handle the exploration problem in language (ensuring the model tries a variety of response styles instead of prematurely converging). As noted in a 2022 analysis, many RLHF design choices are not fully explored yet, and there’s room for improved optimizers beyond PPOhuggingface.co huggingface.co.

Scaling Human Feedback with AI and Automation: One way to overcome the data bottleneck is to use AI to assist or replace human feedback in some stages. Anthropic’s Constitutional AI is one approach: they generate feedback by using the model itself (or another AI) to critique outputs against a set of written principles, thus creating a sort of AI feedback loop guided by a "constitution" of rules. OpenAI has also begun integrating rule-based feedback: for instance, their 2024 research on Rule-Based Rewards (RBR) uses a set of human-written rules (like “the response should contain an apology if it’s a refusal”) to automatically judge certain aspects of output, which can supplement or replace human labelers for routine safety enforcementopenai.com openai.com. These rules or AI evaluators can be plugged into the RLHF pipeline as additional reward signals. The advantage is that rules can be updated instantly (if policy changes) without recollecting human dataopenai.com. In practice, OpenAI reported using RBR in combination with traditional RLHF for training safer behaviors in GPT-4openai.com. We expect to see hybrid approaches where human feedback is used for subtle, hard-to-formalize judgments, and automated rewards (from heuristic rules or other models) are used for straightforward criteria. This can make alignment training more scalable and updatable.

Continuous and Adaptive Feedback (Online RLHF): So far, most RLHF training is done offline: collect data once, train the model, deploy it. An intriguing direction is making the model learn on the fly from user interactions. Imagine if a deployed chatbot could ask users for feedback or notice implicit signals (like whether the user rephrased a question, indicating the first answer wasn’t good) and update itself continuously. This iterative online RLHF could use techniques like online reward model updates or bandit algorithms to fine-tune with a live feedback stream. Anthropic and others have discussed the idea of ELO-style rating systems, where models continuously get compared and ranked as they chat with usershuggingface.co. However, doing this safely is challenging: learning on the fly risks destabilizing the model, and there’s a danger of the model being influenced by a small subset of users or drifting from its initial safety alignment. The dynamics of a model that is updating itself based on user input create new complexities (the model’s behavior and the feedback it gets are interdependent)huggingface.co. In the near term, we might see more controlled versions of this, such as periodic retraining using logs of real user conversations (with crowdworkers labeling those after the fact). Over the long term, reinforcement learning at deployment could allow models to personalize and improve, if we can ensure they do so without catastrophically forgetting or misbehaving.

Understanding and Improving Reward Models: Since reward models are central to RLHF, another future direction is improving how we train and use them. One idea is to make reward models that factor in uncertainty or that can say “I’m not sure” when comparisons are ambiguous. Another idea is to use larger language models as reward models – for example, using GPT-4 to judge the outputs of a smaller model. This was actually tested in the LLaMA-2 paper, where they found a GPT-4 based reward model was a strong baseline (though a dedicated reward model fine-tuned on human data could outperform it in their setting)interconnects.ai. Using powerful models to evaluate others (sometimes called AI feedback, as opposed to human feedback) is promising, especially if the powerful model has been aligned with human values. We might also see multi-objective reward models (combining helpfulness, correctness, stylistic preferences, etc.) or methods to calibrate reward models so their scores more directly translate to human satisfaction measures.

Transparency and Interpretability: Going forward, there’s interest in making RLHF-trained models more interpretable. Because the policy is being influenced by the reward model, researchers are analyzing what exactly the model learns during RLHF. Some studies have looked at whether RLHF mainly affects the style vs. substance of responses. Others have tried to extract rules the model seems to follow after RLHF (for example, “always apologize when refusing” as a learned behavior). By understanding this, we can potentially shape reward functions more intelligently. Also, techniques like mechanistic interpretability (opening up the model’s neurons) could identify if RLHF is causing any undesired biases to strengthen or if certain circuits are overly optimized for the reward at the expense of truth. This is still very much research, but as RLHF becomes a staple of alignment, tools to audit and verify the aligned models will be crucial.

Broader/Conditional Preference Alignment: Currently, a model like ChatGPT is aligned to be a generally helpful AI for a broad user base. In the future, we might want models that can adapt to different ethical frameworks or personal preferences. There’s preliminary research into letting users set their own “AI Constitution” or sliders for the assistant’s behavior. RLHF could be extended by having multiple reward models (e.g., one reflecting one set of values, another reflecting a different set) and somehow allowing the model to switch or interpolate between them based on context. Achieving this without retraining from scratch each time is tricky, but it could involve conditional training or meta-learning. It ties into the question of how to align AI with plurality of human values rather than a single aggregated preference. OpenAI’s work noted that their labelers’ preferences might not represent all usersopenai.com, implying a need to diversify feedback sources or provide customization.

Reinforcement Learning from Human Feedback (RLHF)

What is RLHF?

Key Components of an RLHF Pipeline

RLHF in Practice: Fine-Tuning a Language Model (Step-by-Step)

Real-World Applications of RLHF

OpenAI: InstructGPT and ChatGPT

Google DeepMind: Sparrow and Beyond

Meta: LLaMA 2-Chat

Tools, Libraries, and Resources for RLHF

Limitations and Challenges of RLHF

Future Directions and Emerging Ideas

More posts

Reinforcement Learning from Human Feedback (RLHF)

What is RLHF?

Key Components of an RLHF Pipeline

RLHF in Practice: Fine-Tuning a Language Model (Step-by-Step)

Real-World Applications of RLHF

OpenAI: InstructGPT and ChatGPT

Google DeepMind: Sparrow and Beyond

Meta: LLaMA 2-Chat

Tools, Libraries, and Resources for RLHF

Limitations and Challenges of RLHF

Future Directions and Emerging Ideas

More posts