Making a switch — Dolma moves to ODC-BY

We’re moving the Dolma dataset to the ODC-BY license. Here’s why.

AI2
AI2 Blog

--

Dolma’s official logo. It’s dolma written in yellow, round lowercase letters over a blue background.

When we released the Dolma dataset in August 2023, we were excited to lead the way in sharing assets that weren’t normally associated with language model releases. Our goal continues to be open access to AI models and their component parts so that we can all learn how this new technology works.

We initially released Dolma as the pretraining corpus for OLMo, AI2’s open language model; however, since its first release, the research community has used Dolma in unexpected ways: predicting which data closed language models might have been trained on, training multilingual models, or experimenting with more computationally efficient training techniques. Based on feedback from the research community, we recognized that such creative uses of Dolma require a more flexible license.

As of today, Dolma is now operating under the ODC-BY license. We believe that, given the way our users have been accessing and using Dolma, this is going to make your work significantly easier moving forward. We have used the ODC-BY license with a number of our previous dataset releases, like the Tulu-2 instructions mix, our replication of the C4 dataset, and PeS2o, and believe this will allow users the right level of flexibility to achieve open AI research.

The ODC-BY license is an Open Data Commons license, meaning that it grants the public permission to use Dolma, and users may copy, reproduce, and distribute Dolma. Users can also modify and create derivative works using all or a substantial portion of the Dolma dataset. While any redistribution or derivative must have an attribution notice that says the produced work was obtained from Dolma, the work itself does not have to be licensed as ODC-BY.

For users who downloaded Dolma while it was licensed under the ImpACT license, you do not have to redownload the dataset in order to use it under the terms of ODC-BY. Additionally, if you’ve made a derivative of Dolma, there is no requirement that you change the license to your derivatives. If you applied the ImpACT license or a similar variation to your derivative, you may choose to keep it or use a different license.

We continue to stand by our core values of transparency and responsible AI development with a goal of facilitating safer AI models. With Dolma, our researchers took precautions like removing PII and hateful content with the tools and resources we have today, but the technology is not perfect and there should be continued efforts to understand the potential harms of AI.

With this license change, we hope to support the needs of our users and enable the community to improve AI through collaboration and open research.

Follow @allen_ai on Twitter/X, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

--

--

Our mission is to contribute to humanity through high-impact AI research and engineering. We are a Seattle-based non-profit founded in 2014 by Paul G. Allen.