The Unreasonable Effectiveness of Easy Training Data

Published in

AI2 Blog

3 min readJan 16, 2024

We typically train AI systems to answer questions in specific domains (like STEM) by finetuning a model on example questions and answers. But what happens when it’s hard to collect examples to train the model on, because only experts can answer the questions or the questions are so hard that experts often get them wrong?

In a new paper, we present results showing that language models can perform well on hard, domain-specific questions when trained only on easy questions. Below, we see that a language model’s exam scores on college-level STEM questions are almost as good when it’s trained on 3rd grade questions as when it’s trained on college questions!

A model trained on easy data (e.g., 3rd Grade problems) does almost as well on college test problems as a model trained on college problems (Mixtral-8x7B prompted with k = 10 examples). Random accuracy is 25%.

In fact, we do equally well on college STEM questions whether we train the model on college or high school level questions. (Here, training can mean using in-context learning, training a linear classifier head, or finetuning with QLoRA.)

Why would this matter in the real world? Because gathering data in specialized domains like medicine and law is expensive, and even experts can give noisy answers to hard questions. If we can perform well on hard questions by training on easier data, then we might be able to save a lot of time and effort while still producing reliable models that are useful for people. This problem has been termed the scalable oversight problem, to describe situations where it is difficult to properly train (oversee) models to accurately answer questions in domains of increasing (scaling) complexity.

Our findings imply that easy training data can be better than hard training data in practice, since hard data is generally noisier and costlier to collect:

This figure shows model performance on hard test questions, split between two conditions, based on the model being trained on easy or hard data. We find that, while training on hard data is better than training on easy data when both training datasets are cleanly labeled, it is better to train on easy data than hard data once there is a sufficient level of label noise in both datasets, assuming that the hard data is twice as noisy as the easy data. — Easy data can be better training data than hard data when hard data labels are noisier. Results shown for college-level STEM questions from MMLU, using a linear classifier head on Llama-2–70b fit to 160 points.

In the paper, we demonstrate the same conclusion for when hard data is costlier to collect than easy data — you can do better on hard test data by training on easy data! These are actually common scenarios too. Some kinds of “hard” data can easily cost twice as much to collect and be twice as noisy as “easy” data in the same domain.

Will these results hold up as language models continue to improve? Interestingly, we observe good easy-to-hard generalization across model sizes between 7b and 70b parameters:

Easy-to-hard generalization is very similar across model sizes. Results for a 7 billion parameter language model are equally as good as for a 13 billion parameter model and a 70 billion parameter model. — Easy (high school) training data is as good as hard (college) training data for performance on college level STEM questions, regardless of model size. We use ICL with k=10 examples here.

This kind of scaling result is important because it suggests that as models become more capable over time, easy-to-hard generalization will continue to be about as good as hard-to-hard generalization in settings like these. This means that we can do well on hard test questions without having to label hard training questions!

Conclusion. It appears that easy-to-hard generalization in LMs is often surprisingly strong, suggesting that the scalable oversight problem may be easier than previously thought. That said, we encourage future research into easy-to-hard generalization, to help models perform well for even harder tasks.

Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college-level STEM questions and general-knowledge trivia. Check out our public code here!

And see the paper here!

Check out our current openings, follow @allen_ai on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

The Unreasonable Effectiveness of Easy Training Data

Written by Peter Hase