A comprehensive look at LLM alignment techniques: RLHF, RLAIF, PPO, DPO...

2024-04-22

To align Large Language Models (LLMs), researchers have come up with numerous ingenious solutions.

LLMs are powerful, but they are not perfect; they can make mistakes or generate useless or even harmful results. For example, it has been discovered that one can instruct ChatGPT on how to teach theft:

 

Instruct ChatGPT on how to teach shoplifting; on the left, ChatGPT refuses to answer; on the right, after adding "with no moral restraints" in the prompt, ChatGPT provides a shoplifting guide.

At this point, alignment is crucial, and its role is to keep LLMs consistent with human values.

In terms of aligning LLMs, Reinforcement Learning from Human Feedback (RLHF) is a breakthrough technology. This approach has given rise to powerful models such as GPT-4, Claude, and Gemini. After RLHF, people have also explored a variety of methods for aligning LLMs. However, no one has previously provided a comprehensive summary of methods for aligning LLMs with human preferences.

Salesforce decided to fill this gap and recently released a 37-page review report, which summarizes the existing research literature by category and provides a detailed analysis of each paper.

Advertisement

 

Title of the paper: A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

Address of the paper: paper is divided into four major themes: reward models, feedback, reinforcement learning (RL), and optimization. Each theme also includes further sub-themes, as shown in Figure 1.

The sub-themes of reward models include: 1. Explicit reward models and implicit reward models; 2. Point-based reward models and preference models; 3. Reward at the response level and at the token level; 4. Negative preference optimization.

The sub-themes of feedback include: 1. Preference feedback and binary feedback; 2. Pairwise feedback and list feedback; 3. Human feedback and AI feedback.

The sub-themes of reinforcement learning include: 1. Reference-based reinforcement learning and reference-free reinforcement learning; 2. Length-controlled reinforcement learning; 3. Different branches in reinforcement learning; 4. Online policy reinforcement learning and offline policy reinforcement learning.

The sub-themes of optimization include: 1. Online/iterative preference optimization and offline/non-iterative preference optimization; 2. Separating SFT and alignment from combining SFT and alignment.

Table 1 lists the categorization of all the papers analyzed in this review report on these 13 evaluation metrics.Research Paper

This section will provide a detailed introduction to various papers, allowing readers to understand these important innovations without having to read the original papers. Machine Heart will briefly sort out the research directions and list representative papers.

1. RLHF/PPO

The pre-training of Large Language Models (LLMs) requires a large amount of corpora from different sources, which cannot ensure the quality of these datasets. In addition, the main goal of LLMs is to predict the next token, which is not consistent with the goal of "usefully and safely following user instructions." Therefore, LLMs may output content that is not true, harmful, or useless to users. Essentially, these models are not aligned with user intentions. The main goal of RLHF/PPO is to align the language model with user intentions on various tasks, and the method is to fine-tune the model using human feedback. There is a lot of research on this topic.

InstructGPT

InstructGPT comes from OpenAI, which is the foundation for training models such as ChatGPT and GPT-4. See the "GPT-4 Technical Report" and the Machine Heart report "GPT-4 Shocking Release: Multimodal Large Model, Direct Upgrade of ChatGPT, Bing, Open API, Game Over?"

"Learn from Li Mu the Technology Behind ChatGPT: 67 Minutes to Understand the InstructGPT Paper."

The problem of evaluating the responses generated by LLMs by incorporating human preferences has been solved. Traditional evaluation metrics such as BLEU, ROUGE, and BERTScore used to evaluate LLMs cannot ensure consistency with human preferences. To solve this problem, researchers directly integrated human preferences into LLMs to enhance their performance. This process usually involves two main steps: reward model learning and reinforcement learning strategy training.

During the reward model learning phase, an explicit point-by-point reward function is trained using prompts and paired responses.

After that, the reinforcement learning strategy training phase begins; in this phase, the LLM and the pre-trained reward model serve as an agent and environment, respectively, within a reinforcement learning framework.To train InstructGPT, three datasets are utilized: 1. SFT Dataset: Contains annotator demonstrations for training the SFT model. 2. RM (Reward Model) Dataset: Comprises rankings of model outputs by human annotators, used for training the reward model. 3. PPO Dataset: Constitutes prompts used as inputs for RLHF fine-tuning.

The trained InstructGPT is assessed in three aspects: usefulness, credibility, and harmfulness.

From the results, human evaluations indicate that "compared to the 175B GPT-3, people prefer the outputs of the 1.3B parameter version of the InstructGPT model, despite the latter having more than 100 times fewer parameters." It is noteworthy that InstructGPT outperforms GPT-3 in both usefulness and toxicity tasks, which is crucial for alignment.

Anthropic's RLHF

Anthropic has also researched the same topic, with the paper titled "Training a helpful and harmless assistant with reinforcement learning from human feedback."

OpenAI found that RLHF is conducive to alignment but may also lead to a decline in model performance on some NLP benchmarks, a phenomenon known as the "alignment tax." The InstructGPT model developed by OpenAI has 1.3B parameters. In contrast, researchers at Anthropic evaluated seven different models ranging in size from 13M to 52B, with the model sizes increasing geometrically by a factor of 4.

They concluded that for smaller models, alignment incurs a "tax," but for larger models, alignment only brings benefits, especially for models with parameter counts between 13B and 52B.

Considering this advantage of alignment, they also experimented with using programming technical datasets to enhance the capabilities of LLMs. OpenAI's RLHF method includes PPO and PPO-ptx, with the design goal of PPO-ptx being to reduce the alignment tax on NLP benchmarks. Anthropic's RLHF research found that as long as the model is large enough, PPO itself can bring alignment benefits to NLP downstream tasks. They also determined the optimal parameter for KL divergence in reinforcement learning strategy training to be β = 0.001.

Online / Iterative RLHF

Traditionally, RLHF techniques for aligning LLMs have been offline methods. However, these methods have some drawbacks, such as the difficulty in dealing with out-of-distribution data.For this purpose, continuous fine-tuning of Large Language Models (LLMs) is required, involving iterative or online learning. This process uses an intermediate strategy to generate responses to prompts, then employs an oracle to provide preference feedback for such pairs of data, and feeds these feedback back into the strategy. In practice, iterative learning is divided into two parts: preference oracle learning and iterative strategy optimization. See the paper "RLHF workflow: From reward modeling to online RLHF."

2. RLAIF

The cost of obtaining human preference datasets is not low, hence the emergence of Reinforcement Learning from Artificial Intelligence Feedback (RLAIF). Moreover, as the capabilities of LLMs continue to improve, the quality of the AI preference datasets that can be collected also improves, thereby enhancing the alignment effect of LLMs.

Anthropic's RLAIF

Building on the foundational research of RLHF, Anthropic proposed a new method called RLAIF. See the paper "Constitutional ai: Harmlessness from ai feedback."

This method mainly includes two stages: 1. Supervised learning through Critiques and Revisions, guided by a constitution. 2. RLAIF.

Google's RLAIF

Based on the research results of Anthropic's RLAIF, a Google research team believes that previous studies cannot directly compare the effects of human feedback and AI feedback, and further research is warranted. In the process of collecting AI feedback, it is necessary to create a structured prompt, which includes: introduction, few-shot examples (optional), samples to be labeled, and conclusion.

To generate AI feedback, a two-step evaluation is required: first, use the four components in the instructions plus the Chain of Thought (CoT) to let the LLM generate a response. In the next step, this LLM response is sent back to the LLM with an ending like "preferred summary =" to generate preference probabilities like "summary 1 = 0.6, summary 2 = 0.4". To reduce positional bias, the sequences of these two responses need to be alternated, and their average scores are calculated.

The RLAIF process uses two strategies: 1. "Distilled RLAIF," which follows the traditional RLHF method, that is, using preferences to train a reward model, and then using it to train the LLM strategy; 2. "Direct RLAIF," which directly uses LLM feedback as a prompt to output evaluation scores, and then uses this score as a signal for training the reinforcement learning strategy.Finally, its evaluation process will utilize three key metrics: 1. AI - Annotator Alignment: The degree of consistency between AI and human annotators. 2. Win Rate: The likelihood of human annotators comparing two candidates and choosing one of them. 3. Harmlessness Rate: The proportion of responses deemed harmless by human evaluators.

For more details, please refer to the paper "RLAIF: Scaling reinforcement learning from human feedback with AI feedback."

Direct Human Preference Optimization

Traditional RLHF methods often involve optimizing a reward function derived from human preferences. While effective, this approach can also present some challenges, such as increasing computational complexity and the need to consider bias-variance trade-offs when estimating and optimizing rewards. See the paper "High-dimensional continuous control using generalized advantage estimation."

Recently, research has explored other methods aimed at directly optimizing LLM strategies based on human preferences (without relying on a scalar reward signal).

The goal of these methods is to simplify the alignment process, reduce computational costs, and achieve more robust optimization by using preference data more directly. By framing the problem as a preference optimization issue, rather than a reward estimation and maximization issue, these methods offer a different perspective on aligning language models with human judgments:

SliC-HF, sequence likelihood calibration with human feedback, see the paper "SliC-HF: Sequence likelihood calibration with human feedback."

RSO, Reject Sampling Optimization, see the paper "Statistical rejection sampling improves preference optimization."

DPO, Direct Preference Optimization, see the paper "Direct preference optimization: Your language model is secretly a reward model."

DPOP, DPO-positive, see the paper "Smaug: Fixing failure modes of preference optimization with DPO-positive."Direct Preference Optimization (DPO), refer to the paper "Direct Preference Optimization with Dynamic Beta".

Identity Preference Optimization (IPO), refer to the paper "A general theoretical paradigm to understand learning from human preferences".

Stepwise DPO (sDPO), refer to the paper "sDPO: Don't use your data all at once".

Generalized Preference Optimization (GPO), refer to the paper "Generalized preference optimization: A unified approach to offline alignment".

Token-level DPO

When using DPO, the reward is allocated to both the prompt and the response. In contrast, when using MDP, the reward is allocated to each action. The following two papers elaborate on DPO at the token level and extend its application to token-level analysis.

Research on token-level credit assignment with DPO can be found in the paper "From r to Q*: Your language model is secretly a Q-function", and the report "This is the mysterious Q* of OpenAI? Stanford: Language models are Q functions".

Token-level Direct Preference Optimization (TDPO), refer to the paper "Token-level direct preference optimization".

Iterative / Online DPO

When using DPO, all available preference datasets are used to align the Large Language Model (LLM). To continuously improve the LLM, iterative or online DPO should be implemented. This raises an interesting question: how to efficiently collect new preference datasets. The following two papers delve into this topic.Self-Rewarding Language Models, refer to the paper "Self-rewarding Language Models."

CRINGE, refer to the paper "The Cringe Loss: Learning What Language Not to Model."

Binary Feedback

It turns out that collecting preference feedback is more difficult than collecting binary feedback (such as likes or dislikes), thus the latter can facilitate the expansion of the alignment process. The studies of KTO and DRO focus on using binary feedback to align Large Language Models (LLMs).

KTO, Kahneman-Tversky Optimization, refer to the paper "KTO: Model Alignment as Prospect Theoretic Optimization."

DRO, Direct Reward Optimization, refer to the paper "Offline Regularised Reinforcement Learning for Large Language Models Alignment."

Integration of SFT and Alignment

Previous studies mainly performed Supervised Fine-Tuning (SFT) and alignment sequentially, but it has been proven to be labor-intensive and can lead to catastrophic forgetting. Subsequent research has two directions: one is to integrate these two processes into a single step; the other is to fine-tune two models in parallel and then integrate them.

ORPO, Odds Ratio Preference Optimization, refer to the paper "ORPO: Monolithic Preference Optimization without Reference Model."

PAFT, Parallel Fine-Tuning, refer to the paper "PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning."Length-Controlled DPO and Reference-Free DPO

Previous studies have indicated that the outputs of Large Language Models (LLMs) tend to be excessively verbose. To address this issue, the focus of R-DPO and SimPO is to control the length of the response without compromising the generation performance.

Moreover, DPO requires a reference strategy to ensure that the aligned model does not deviate significantly from the reference model. In contrast, SimPO and RLOO propose methods that can eliminate the need for a reference model without affecting the performance of LLMs.

R-DPO, Regularized Direct Preference Optimization, see the paper "Disentangling length from quality in direct preference optimization."

SimPO, Simple Preference Optimization, see the paper "SimPO: Simple preference optimization with a reference-free reward," and the report "Surpassing DPO comprehensively: Chen Danqi's team proposes Simple Preference Optimization SimPO and also refines the strongest 8B open-source model."

RLOO, REINFORCE Leave-One-Out, see the paper "Back to basics: Revisiting reinforce style optimization for learning from human feedback in LLMs."

Listwise Preference Optimization

Previous research in PPO and DPO focused on pairwise preferences, while research in RLHF collected listwise preferences to accelerate the data collection process, which were then converted into pairwise preferences. Nevertheless, to enhance the performance of LLMs, it is feasible to directly use a listwise dataset to perform preference optimization. The following three papers specifically discuss this method.

LiPO, Listwise Preference Optimization, see the paper "LIPO: Listwise preference optimization through learning-to-rank."

RRHF, see the paper "RRHF: Rank responses to align language models with human feedback without tears."PRO, Preference Ranking Optimization, refer to the paper "Preference ranking optimization for human alignment."

Negative Preference Optimization

These studies share a common premise: the current generation of Large Language Models (LLMs) has already surpassed human performance in tasks such as translation and summarization. Therefore, the output of LLMs can be regarded as the desired response, without relying on human-annotated data as preference responses; there are benefits to doing so. Conversely, undesired responses can also be used to align LLMs, and this process is known as Negative Preference Optimization (NPO).

NN, Negating Negatives Method, refer to the paper "Negating negatives: Alignment without human positive samples via distributional dispreference optimization."

NPO, Negative Preference Optimization, refer to the paper "Negative preference optimization: From catastrophic collapse to effective unlearning."

CPO, Contrastive Preference Optimization, refer to the paper "Contrastive preference optimization: Pushing the boundaries of LLM performance in machine translation."

Nash Learning

Previous studies typically used point-wise rewards and BT models to obtain pairwise preferences. However, this method is not as effective as direct pairwise preference modeling and cannot resolve inconsistencies in pairwise preferences. To overcome these limitations, some studies have proposed Nash learning methods.

Nash Learning from Human Feedback, refer to the paper "Nash learning from human feedback."

SPPO, Self-Play Preference Optimization, refer to the paper "A minimaximalist approach to reinforcement learning from human feedback."Direct Nash Optimization (DNO), refer to the paper "Direct Nash Optimization: Teaching Language Models to Self-improve with General Preferences."

Comparison of Different Methods

Some studies are aimed at comparing these different methods. Such studies can elucidate the strengths and weaknesses of each method.

Evaluating DPO and its Variants

The paper "Insights into alignment: Evaluating DPO and its variants across multiple tasks" comprehensively evaluates implicit reward models, that is, reinforcement learning algorithm-free approaches, including DPO, KTO, IPO, and CPO, across various tasks such as reasoning, mathematical problem-solving, credibility, question-answering, and multi-task understanding. These evaluations involve three different scenarios: 1) fine-tuning supervised fine-tuning (SFT) models, 2) fine-tuning pre-trained models, 3) fine-tuning instruction models.

The study found that KTO outperforms other alignment methods on most benchmarks. In addition, the study indicates that alignment does not significantly improve the model's reasoning and question-answering performance, but it does greatly enhance the model's mathematical problem-solving ability. The study also noted the importance of data volume, with alignment methods performing best on smaller data subsets. Furthermore, the study found that KTO and CPO can effectively bypass the SFT stage, entering the alignment stage directly without affecting performance. In contrast, when bypassing the SFT stage and entering the alignment stage directly, DPO and IPO exhibit a significant performance decline.

Is DPO a Better LLM Alignment Method than PPO?

The paper "Is DPO superior to PPO for LLM alignment? A comprehensive study" suggests that DPO may have inherent limitations, may produce biased answers, and may suffer performance degradation due to distribution shifts.

They found that policies trained by DPO tend to favor unseen responses, especially out-of-distribution samples. Iterative/online DPO can mitigate this issue by extensively exploring the response space and continuously updating the reference model. In contrast, RLHF/PPO addresses these challenges through advantage normalization, large batch sizes, and the use of exponential moving averages for the reference model. Ultimately, these findings indicate that PPO is superior to iterative/online DPO, which is further superior to standard DPO.

For more details, refer to the column article on Machine Heart "ICML 2024 Oral | Is DPO More Suitable for LLM than PPO? The Latest Revelations from Tsinghua University's Wu Yi Team."Future Directions

By analyzing past papers, the team has identified several research questions that warrant further exploration.

General Tasks for Alignment Assessment

Different papers have used various tasks to evaluate the performance of these methods. However, some tasks like GSM8K focus more on reasoning and may not be suitable for assessing alignment performance. In contrast, tasks such as TruthfulQA or those focusing on toxicity should be prioritized to evaluate the toxicity of fine-tuned large language models (LLMs). Efforts should be made to combine these tasks to create a unified leaderboard for assessing alignment.

Using Implicit Reward Models, Listwise Preferences, and Nash Learning for Larger Language Models

Currently, the largest model using implicit reward models has only 70 billion parameters. If these methods can be extended to larger models, such as those the size of GPT-4 and Claude-3, it should help us better understand their relative effectiveness compared to RLHF/PPO.

Similarly, listwise preference models are worth further investigation. When using RLHF, listwise preferences are collected to create a preference dataset, which is then converted into multi-pairwise preference data. The potential issues of applying listwise preference models on a large scale still need to be addressed.

Lastly, Nash learning can address inconsistencies among human annotators. If Nash learning models can be integrated into larger LLMs, it could demonstrate their ability to capture the complexity of human nature.

Experiments on Binary Feedback

Both KTO and DRO use binary feedback mechanisms like "likes" and "dislikes," rather than pairwise preferences. These binary feedbacks come from preference datasets, where the desired responses are marked as positive examples, and the undesired responses are marked as negative examples. Further research is needed on real-world binary datasets. Moreover, compared to preference data, binary datasets are easier to collect, so there is hope to use larger binary feedback datasets for alignment. However, the noise in binary feedback may be more pronounced than the noise in preference datasets, so how to effectively filter out noisy data is also a very interesting research direction.Experiment Research on Useful AI Feedback

The current AI feedback mainly includes harmless feedback in RLAIF and feedback ranking in iterative DPO. However, when using RLAIF, useful feedback is still provided by human annotators. This approach is reasonable because the difficulty of generating useful responses is much greater than identifying harmful feedback. An interesting future research direction is to use LLMs to generate useful feedback, thereby allowing LLMs to self-improve.

Accelerating Nash Learning

Nash learning methods can effectively model pairwise preferences and resolve inconsistencies between human annotators. However, it requires multiple iterations to converge to the optimal strategy. Although the authors did not explicitly state the time required for alignment, it can be guessed that it will be much slower than implicit reward models such as DPO. Therefore, accelerating the Nash learning process is also a research direction worth paying attention to.

Termination of Iterative / Online Learning

When using iterative / online training, determining the time to terminate the iteration is crucial. Previous studies have found that iterative learning sometimes reduces the performance of LLMs on certain tasks, which may be a sign of overfitting. However, no researchers have yet explored how to determine a reasonable epoch for terminating iteration.

Simplifying SFT + Alignment

Current methods usually implement SFT and alignment in a continuous manner. However, this approach often leads to catastrophic forgetting and makes the entire training process more laborious. The PAFT method reduces catastrophic forgetting by first fine-tuning SFT and alignment separately and then integrating them together, but this also increases complexity. In contrast, the ORPO technique integrates these two processes simultaneously, but it leads to a decline in performance. So, how can SFT and alignment be effectively combined to achieve high performance while maintaining high efficiency? This is still a challenge to be solved.

For more details, please refer to the original paper.

Comment