RewardBench: the first benchmark & leaderboard for reward models used in RLHF

Published in

AI2 Blog

5 min readMar 20, 2024

We introduce RewardBench, a benchmark for evaluating preference reward models. We test the limits of reward models on everything including instruction following, safety, and reasoning.

As part of this project, we release the following artifacts:

Leaderboard: https://hf.co/spaces/allenai/reward-bench
Code: https://github.com/allenai/reward-bench
Dataset: https://hf.co/datasets/allenai/reward-bench
Paper (Arxiv soon): https://github.com/allenai/reward-bench/blob/main/rewardbench.pdf

Reward models (RM) play an important part in aligning pretrained models to human preferences via the Reinforcement Learning from Human Feedback (RLHF) process, but to date, few systematic evaluations of RMs have been proposed. Today, most aligned models can be used as a reward model, through the rise of Direct Preference Optimization (DPO) (Rafailov et al., 2023).

The first step of training a reward model, and therefore doing RLHF, is collecting preference data from a group of human labelers. Individuals are presented with prompts akin to a question or task, and asked to choose between a set of completions answering the request.

The resulting data is transformed into a set of prompt-chosen-rejected trios, where the chosen completion is preferred over the rejected completion for training. This data is then used to either train an RM classifier, or in the case of DPO, directly align a pretrained model with human preferences, which can then be used as an indirect RM. For other algorithms, such as PPO, the RM classifier is used to align a pretrained model’s outputs to human preferences through RLHF.

The analysis and evaluation of RMs is important because they learn (human) preferences and these learned preferences are then applied to LLMs. This can make a model more helpful and less harmful, and generally give more useful responses to requests from users. However, RMs themselves are underexplored, and more work is needed to understand their capabilities and correlate their performance with the performance of downstream, aligned models.

We hope that our benchmark dataset and code base enable a better understanding of the opaque techniques used for alignment. To start, we evaluated over 30 RMs, covering most of the accessible models out there. (We encourage you to submit your own!)

Our RM evaluation process is detailed in the following illustration — we created structured test cases where an RM should prefer one answer over another:

Illustration of the RM evaluation process: Given prompts from multiple sources, we curated chosen and rejected completions. For each of the prompts and completions we obtain a reward from the RM, which counts as a win if the reward for the chosen completion is higher than for the rejected completion.

Given prompts from multiple sources, we curated chosen and rejected completions. For each of the prompts and completions, we obtain a reward from the RM, which counts as a win if the reward for the chosen completion is higher than for the rejected completion.

We find the overall top 5 models are:

berkeley-nest/Starling-RM-34B: A new fine-tune of Yi-Chat with Starling’s k-wise loss function (not just pairwise preferences)
allenai/tulu-2-dpo-70b: The popular Llama 2 70b fine-tune with DPO from last fall
mistralai/Mixtral-8x7B-Instruct-v0.1: Mistral’s DPO model
berkeley-nest/Starling-RM-7B-alpha: A Llama 2 chat fine-tune by the Starling team
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO: Another DPO fine-tune

For more models and search tools, please see the leaderboard.

Data

We curate a test set that consists of four different categories: Chat, Chat Hard, Safety, and Reasoning. We also evaluate on “prior” sets, which are existing, commonly used preference test sets. Each of the categories contains various subsets of prompt sources. The completions are either strategically sampled or hand-selected from a set of different models.

Our codebase was designed for this evaluation set, but we will be extending it so it’s easy to expand to other preference datasets, such as popular training datasets like UltraFeedback and Nectar. Easily evaluating agreement (even with training sets) is an important feedback mechanism when designing new RLHF methods.

Models

Our code supports the evaluation of both classifier reward models trained using MLE, and DPO models, which are indirect reward models.

Classifiers

We evaluate a large number of public reward models, ranging from 400 million to 70 billion parameters.

DPO

Because DPO models could indirectly be seen as reward models, we also evaluate public DPO models with different sizes. Given two completions to a prompt, we compare the rewards for chosen (y1) and rejected (y2) completions as follows, where 𝚷 is computed by comparing the log probabilities of the policy and the reference model for a given completion:

Conclusions

In summary, here are our key findings:

DPO models, while more abundant due to the method’s relative simplicity, fail to generalize to popular preference data test sets and present a higher variance in performance.
Many RMs struggle with challenging preference sets, such as adversarially designed responses that don’t address the prompt.
Future work on RM evaluations still needs to analyze the correlation of RM performance with downstream policy performance.

Our contributions are:

We release a common framework for evaluating the many different architectures of reward models, along with tools for visualization, training, and other analysis. (And also instructions for adding your own model!)
We also release all data used in the evaluation, composed of text-score pairs for all inputs, to enable further data analysis on the properties of reward models.
We chart the landscape of current state-of-the-art reward models: We showcase the scaling laws, the propensity to refuse (or not to refuse) unsafe requests, the reasoning capabilities, and more, for popular RMs.
We show the limitations of existing preference data test sets for evaluating these models, showcasing common pitfalls of RMs on subtle, but challenging instruction pairs (e.g. intentionally modified rejected responses, which superficially look high quality but answer the wrong prompt).

We encourage NLP researchers to use this leaderboard as a benchmark for evaluating reward model capabilities and inform their experimental design decisions in training large language models.

Finally, if you are developing a reward model, we encourage you to contribute to the leaderboard on GitHub so that other researchers and practitioners can benchmark their models against yours and collectively advance the field.

This blog post was written by Valentina Pyatkin, LJ Miranda, Jacob Morrison, and Nathan Lambert, with help and feedback from Ashley Lee, Tom Zick, and David Atkinson.

The paper’s authors are Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith and Hannaneh Hajishirzi.

RewardBench: the first benchmark & leaderboard for reward models used in RLHF

Data

Models

Classifiers

DPO

Conclusions

Written by AI2