OLMo 1.7–7B: A 24 point improvement on MMLU

AI2
AI2 Blog
Published in
5 min readApr 17, 2024

--

Today, we’ve released an updated version of our 7 billion parameter Open Language Model, OLMo 1.7–7B. This model scores 52 on MMLU, sitting above Llama 2–7B and approaching Llama 2–13B, and outperforms Llama 2–13B on GSM8K (see below).

OLMo 1.7–7B — created on the path towards our upcoming 70 billion parameter model — showcases a longer context length, up from 2048 to 4096 tokens. It exhibits higher benchmark performance due to a combination of improved data quality, a new two-stage training procedure, and architectural improvements. Get the model on HuggingFace here, licensed with Apache 2.0. The training data, Dolma 1.7, is licensed under ODC-BY, as recently announced.

Since the release of OLMo 1.0, we’ve been focusing on improving a few key evaluation metrics, such as MMLU. Below is a plot showing the approximate compute used to train some language models with open weights, calculated as the product of model parameter count and training dataset size in tokens. We can see that OLMo 1.7–7B gains substantially on compute efficiency per performance relative to peers such as the Llama suite and Mosaic Pretrained Transformer (MPT).

This graph shows OLMo 7Bv1.7’s MMLU score of 52, outperforming Llama 2–7B and approaching Llama 2–13B in performance, when measured relative to the amount of compute used to train it, on the x axis.

We compare this OLMo model to its peer models with known training dataset sizes.

Table comparing OLMo 1.7–7B against 6 state of the art 7B and 13B models. On 12 tasks, OLMo 1.7–7B achieves an average of 63.8, outperforming MPT 7B (59.3), OLMo 1–7B (59.8), Llama 7B (60.3), StableLM 7B (60.4), Llama 2–7B (62.1), and approaching Llama 2–13B (66.2). Notably, it achieves a 52.0 on MMLU 5 Shot MC, compared to 28.3 for OLMo 1–7B.

Dolma 1.7: New data for OLMo

Alongside OLMo 1.7–7B, we are releasing an updated version of our dataset, Dolma 1.7 — in which we focused on (a) exploring more diverse sources of data, and (b) more precise filtering of web sources. The sources are included below, sorted by number of tokens.

Table with data composition for the first stage of pretraining. Olmo 1.7–7B is trained on a sample of 1.7 of 2.3 trillion tokens in Dolma 1.7. Dolma is derived from 15 sources: Common Crawl (1195 billion tokens), RefinedWeb (456), StarCoder (264), C4 (138), Reddit (80), SemanticScholar (57), ArXiv (28 billion tokens), StackExchange (20), Flan (16), News subset of Common Crawl (14), OpenWebMath (13), Algebraic Stack (13), Project Gutenberg books (5), MegaWika (5), and Wikipedia/Wikibooks (4).

*The full Dolma 1.7 collection is 2.3 trillion tokens summing across all sources (see Tokens column). When pretraining, we also need to determine up/downsampling of specific sources to produce our final data mixture; we indicate our preferred mixture in the Sample Proportion column, which yields 1.7 trillion tokens.

The original OLMo 7B released in February of this year was trained on Dolma 1.5. The main differences between Dolma v1.5 and v1.7 are as follows:

  • More sources: Dolma 1.5 mostly contains web data. In Dolma 1.7, we focused on diversifying the sources in the pretraining corpus. 10.4% of the new version of the dataset is comprised of content specifically sourced to improve model performance on tasks requiring specialized knowledge (e.g. arXiv, Stack Exchange) and complex reasoning (e.g. OpenWebMath, Flan). Further, we include 15.4% of code data from StarCoder, replacing the code collection used in Dolma 1.5.
  • Better deduplication: Using the current Dolma pipeline, which already performs exact URL and content deduplication with Bloom filters, we performed another round of document-level deduplication, targeting removal of excessively short documents as well as documents with a high occurrence of repeated n-grams. This version of the dataset was released under Dolma 1.6.
    For Dolma 1.7, we further apply fuzzy deduplication. We remove whole documents with a document-level duplication score exceeding a threshold α, which is calculated as the length-normalized average of the paragraph-level duplication scores. Paragraph-level duplication scores are calculated as a fraction of the n-grams (n=13) that are repeated across Dolma CC, Refined Web and C4. Finally, after this document-level fuzzy deduplication, we additionally remove any paragraphs with paragraph-level score exceeding a threshold β.
    We tuned the filters above through extensive manual validation of thresholds; ultimately, we set α=0.3 and β=0.8. The filter removes 48% of tokens in Dolma CC, 10% in C4, and 12% in Refined Web.
  • Quality filtering: We filter documents using a FastText classifier trained to distinguish between high quality text (that is, well formatted and covering a wide range of useful domains LMs are trained on) and low quality text.
    High quality subset: Wikipedia, web-pages cited in Wikipedia (through MegaWika), Small Web RSS feeds (through Kagi), OpenHermes 2.5, Semantic Scholar, Project Gutenberg, OpenWebMath.
    Low quality subset: Random sample of CommonCrawl raw data, adult entertainment and fake news websites (URLs from the StevenBlack/hosts project).

In total, the classifier is trained on about 25GB of uncompressed text data.

Dolma 1.7 is available today on the Hugging Face Hub. We will openly release all tools used in curating Dolma 1.7, such as the model-based quality filter, in the upcoming days.

Staged training data and learning rate

In contrast to OLMo 1.0, we trained OLMo 1.7 with a two-stage curriculum:

  • In the first stage, we train the model from scratch on the Dolma 1.7 dataset. We set a cosine learning rate schedule with a warmup of 2500 steps, a peak learning rate of 3e-4, and a cosine decay to 3e-5 after 3T tokens. We cut off this stage after 2T tokens, when the learning rate is still high.
  • At this point we switch to the second stage, in which we train on a curated subset of Dolma 1.7 for another 50B tokens, while linearly decaying the learning rate to 0. We curate this high-quality subset by (1) using all available Wikipedia, OpenWebMath and Flan data, (2) removing Dolma CC, CC News, and Megawika, and (3) rebalancing remaining sources to achieve approximately equal proportions of each. See exact token counts and relative proportions of this second stage mix below.
Table showing the composition of data used during the second training stage. The model is annealed on 50 billion tokens from the following datasets: Flan (16%), OpenWebMath (12.4%), Stack Exchange (11.2%), StarCoder (11.2%), Dolma Reddit (9.2%), RefinedWeb (8.8%), C4 (8.8%), Algebraic Stack (7%), Semantic Scholar papers (6.6%), Project Gutenberg Books (5.2%), and Wikipedia & Wikibooks data (3.6%).

To learn more about OLMo, visit the homepage.

OLMo 1.7–7B is trained on Databricks and would not be possible without the collaborative effort from the following partners: Databricks, AMD, CSC LUMI supercomputer, and the University of Washington.

Follow @allen_ai on Twitter/X, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

--

--

Our mission is to contribute to humanity through high-impact AI research and engineering. We are a Seattle-based non-profit founded in 2014 by Paul G. Allen.