OLMo: Open Language Model

A State-Of-The-Art, Truly Open LLM and Framework

Published in

AI2 Blog

6 min readFeb 1, 2024

AI2 opens its framework for training and experimenting with large language models on Hugging Face and GitHub with the launch of our first Open Language Model (OLMo). The AI2 LLM framework is intentionally designed to provide access to data, training code, models, and evaluation code necessary to advance AI through open research to empower academics and researchers to study the science of language models collectively. This approach enables the AI community to access a broader range of research questions, such as understanding the specific impact of certain subsets of pretraining data on downstream performance or investigating new pretraining methods and understanding instabilities.

This effort's first batch of models includes four final variants of our language model at the 7B scale corresponding to different architectures, optimizers, and training hardware, and one model at the 1B scale, all trained on at least 2T tokens. This is the first step in a long series of planned releases, continuing with larger models, instruction-tuned models, and more variants down the line.

Each model comes with the following:

Full training data used for these models, including code that produces the training data, from AI2’s Dolma, and WIMBD for analyzing pretraining data.
Full model weights, training code, training logs, training metrics in the form of Weights & Biases logs, and inference code.
500+ checkpoints per base model, from every 1000 steps during the training process, available as revisions on HuggingFace.
Adapted versions of the 7 billion parameter model, OLMo-7B-Instruct, and an intermediate checkpoint before preference-tuning, OLMo-7B-SFT.
Evaluation code under the umbrella of AI2’s Catwalk and Paloma.
Fine-tuning code and adapted models with Open Instruct.
All code, weights, and intermediate checkpoints are released under the Apache 2.0 License.

A technical report is available here.

In creating strong open models, we learned from many other open and partially open models, comparing to them throughout as competitive baselines for OLMo — EleutherAI’s Pythia Suite, MosaicML’s MPT models, TII’s Falcon models, and Meta’s Llama series of models all served as benchmarks throughout the project. We believe the OLMo 7B model is a compelling and strong alternative to popular models like Llama 2, with different strengths and weaknesses.

The evaluation of OLMo 7B with its peer models is shown below. The top 9 tasks indicate our current internal evaluations of choice for pretrained models, with the bottom three included to round out those on HuggingFace’s Open LLM leaderboard. Note that some of the evaluations in the bottom section are being compared with different methodologies, so not all the numbers are precisely comparable.

The core evaluation results for OLMo 7B in comparison to its peer models.

OLMo 7b is on par with Llama 2 on many generative and reading comprehension tasks (such as truthfulQA), but is slightly behind on popular question-answering tasks such as MMLU or Big-bench Hard.*

And for the 1B OLMo model:

The core evaluation results for OLMo 1B in comparison to its peer models, which shows that OLMo is in line with them.

Using AI2’s Paloma and available checkpoints (code available on GitHub), we analyze the relationship between how well models predict language and factors of model scale such as how many tokens they have been trained on. Paloma attempts to more evenly represent the many domains one would use an LLM in by sampling from each equally. This provides a different view than evaluating the widely varying amounts of each domain confounded together in popular web-scraped datasets like C4 that are curated from Common Crawl. As we can see below where lower Bits per Byte is better, OLMo 7B is right in line with popular models, with the Llamas being on the training path of OLMo.

A detailed figure showcasing the OLMo and peer models’ performances on the Paloma evaluation suite. This shows roughly the per token efficiency of the models.

We performed many experiments with the architecture, data, and everything in between to arrive at this first version. The model architecture follows many trends in recent literature: no biases are used (for stability, as in PaLM), the SwiGLU activation function used by PaLM and Llama, Rotary positional embeddings (RoPE), and a modified version of the BPE-based tokenizer from GPT-NeoX-20B designed to reduce personally identifiable information. For the details of what went wrong, the model architectures we considered, and how to train great LLMs in the future, we recommend you read the OLMo 7B technical report.

Using our Open Instruct and Tulu 2, we adapt OLMo to acquire different capabilities and safety measures through fine-tuning and Direct Preference Optimization (DPO). The adapted models demonstrate quick improvement to popular reasoning tasks such as MMLU and TruthfulQA and safety datasets such as ToxicGen. The SFT checkpoint is a result of supervised fine-tuning on the OLMo 7B model with Open Instruct, the Tulu 2 dataset, and slightly different hyperparameters than were used with the Llama 2 base model versions (labeled AI2’s Tulu below). The Instruct version includes additional training with DPO.

This release is just the beginning for OLMo and the framework. Work is already underway on different model sizes, modalities, datasets, safety measures, and evaluations for the OLMo family. Our goal is to collaboratively build the best open language model in the world, and today we have taken the first step.

Getting Started

OLMo models are easy to use everywhere you’d expect to find and use an LLM today. To use the weights, you must install the OLMo custom modeling code.

pip install ai2-olmo

The weights are available on HuggingFace and can be used with standard inference code:

import hf_olmo
from transformers import pipeline
olmo_pipe = pipeline("text-generation", model="allenai/OLMo-7B")
print(olmo_pipe("Language modeling is "))
>>> Language modeling is the task of …

For fine-tuning and more advanced use cases, see the GitHub repository.

Coming soon to the AI2 framework: instruction-tuned variants of OLMo models, full training logs, and wandb reports, and more.

Contact us: For questions or feedback you can reach us at olmo at allenai dot org or open an issue on GitHub!

For more on this release, please see the press release with information on the open framework we’re building and comments from our partners.

Acknowledgments

OLMo is built on the shoulders of many great efforts before ours. We took particular inspiration from BigScience’s BLOOM efforts, and Eleuther AI’s Pythia project, two great efforts leading the way in the open language modeling ecosystem.

Thank you to our awesome partners!

OLMo not be possible without the collaborative effort from AMD, CSC — IT Center for Science (Finland), Mosaic/Databricks, Kempner Institute at Harvard University, and the Allen School at the University of Washington. Additional thanks to EleutherAI, Meta, Stanford CRFM, TogetherAI and HuggingFace.

*In an earlier version of this post, the Llama and Llama 2 numbers were slightly lower, due to a tokenizer spacing issue which introduced an extra space character before each answer being scored.

OLMo: Open Language Model

A State-Of-The-Art, Truly Open LLM and Framework

Getting Started

Acknowledgments

Written by AI2