Camels in a Changing Climate: Enhancing LM Adaptation With Tulu 2

Hamish Ivison
AI2 Blog
Published in
6 min readDec 8, 2023

--

Logo for Tulu 2 — Open Instruction & RLHF models. A camel with a decorative turquoise saddle.
Logo for Tulu 2 — Open Instruction & RLHF models

Earlier in the year, we released the Tulu suite of models, where we explored the space of instruction fine-tuning in light of the many instruction datasets and models being released (come see our NeurIPs poster on Tuesday @ 10:45 am for more details!). But the community didn’t stop there — since the release of Tulu, better base models have been released, better datasets have been developed, and new adaptation methods (e.g., DPO) have shown promise. With Tulu 2, we have tested and incorporated these recent advancements into a new set of models, testing how far we can further push the limits of open-source models, and providing the community with an open-source ChatGPT equivalent. Our new set of Tulu 2 models achieve at or near state-of-the-art performance on AlpacaEval and LMSYS’s Chatbot Arena for openly-released models, and at the time of release were state-of-the-art on MT-Bench for all open models.

Screenshot from LMSYS Blog. Tulu-2-DPO-70B achieves the highest ELO out of open-weight models, and is just behind Claude-Instant-1.

Alongside our improved models, we have also been hard at work improving our evaluation setup, adding new tasks and improving its speed! By making use of vLLM, we can get results across a varied set of benchmarks in under 30 minutes (for 7B-size models). Our evaluation benchmark now includes MMLU, GSM8k, TydiQA, HumanEval (which we call ‘Codex-Eval’), AlpacaEval, ToxiGen, and TruthfulQA. We report averages across these benchmarks in the post below, but we encourage readers to read our preprint for per-task results!

Let’s briefly go over what’s changed from Tulu 1. The two biggest additions to Tulu 2 are our use of a new dataset mixture and our use of DPO training for training over preference data. We also swapped to using Llama 2 since our original Tulu models, which provides a large boost in performance on its own.

1 — New dataset mixture

Since the release of Tulu 1, the community has doubled down on and improved distilled datasets, with methods like Evol-Instruct and Orca used to improve the quality of data distilled from existing strong models.

Additionally, recent work (e.g., LIMA, LIMIT) has suggested that “a few high-quality samples are all you need”. Inspired by this, and wanting to reduce the overall size of our mixture to reduce compute costs, we downsample elements of our original mixture, such as FLAN, to reduce overall size. Our Tulu 2 mixture contains 100k fewer samples than our original mixture, while significantly improving in performance! We suspect further, more in-depth data curation may lead to further gains and further reduce the mixture size.

Average performance on our benchmark suite for Llama 2 models of various sizes finetuned on our V1 and V2 mixes. Our V2 data mixture performs better on average across all sizes.

2 — DPO Training

Inspired by the success of Zephyr-beta, we applied and scaled their DPO recipe to Llama-2 models of all sizes. Surprisingly, we found that this worked straightaway, and led to significant improvements in open-ended generation benchmarks like AlpacaEval and MT-Bench. Interestingly, DPO finetuning did not significantly drop model performance in most capabilities, apart from TydiaQA (which is likely due to the lack of multilingual data in our finetuning and DPO-training data!)

AlpacaEval performance for Tulu 2 models trained with supervised finetuning (SFT) only or SFT and direct preference optimization (DPO). DPO training greatly improves AlpacaEval performance across all sizes.

In our original Tulu paper, we also noted a strong correlation between model output length and performance on model-based benchmarks like AlpacaEval. We observed this trend remained with more recent AlpacaEval results, and found that while DPO improved AlpacaEval performance significantly, it also increased model output lengths:

Average output length against AlpacaEval winrate for models on the AlpacaEval leaderboard. Tulu models are green stars. We draw an arrow between models pre and post-DPO training. DPO training increases both winrate and verbosity, although our best model is not as verbose as other high-performing models.

However, our models are still significantly less verbose than most high-scoring models with similar performance.

While this all suggests DPO training is very useful for chat performance, we note that training on GPT-distilled outputs and evaluating with GPT-4-based metrics may result in inflated scores. To properly test our model, the folks at LMSYS added Tulu 2+DPO 70B, our largest DPO-trained model, to ChatArena, where real-world users could compare our model to other models with their own prompts. We find that Tulu 2+DPO 70B achieves the same rating as GPT-3.5-turbo-0314, and is the best overall open model tested!

Additional Experiments

We also ran several other ablations and experiments likely interesting to anyone working on LLM finetuning! I’ll highlight two key ones here: QLoRA training and CodeLlama training.

1 — QLoRA Training

We initially explored QLoRA training as a way to further reduce compute costs, allowing us to fit a 70B model on 1 80GB A100. We started by exploring if QLoRA training with various hyperparameters could match the original Alpaca model (i.e. when only training on Alpaca data). We found that while it was easier to match performance on classification tasks like MMLU, performance on open-ended generation tasks like AlpacaEval tended to fall behind full-finetuning.

As such, we eventually decided against using QLoRA. However, since LoRA modules may prove useful on their own, we have trained and released QLoRA-trained Llama 2 models on the Tulu 2 mixture and benchmarked them against our fully-finetuned models. These results mirror our earlier experiments: while QLoRA does improve significantly over the base model, it doesn’t match full-finetuning performance.

Average performance on our evaluation suite for models fully finetuned and models trained using QLoRA. QLoRA training underperforms full-finetuning on average across model sizes.

2 — CodeLlama Training

One of the biggest drawbacks of using Llama-2 is its poor code performance, due to its pretraining mixture. We explored remedying this by using Llama 2 models further pretrained on Code (i.e., CodeLlama) as a base model instead of Llama 2 — we call these models CodeTulu 2 models. We found that while we got significantly improved coding performance, performance in other tasks tended to lag compared to base Llama 2 models (compare the non-coding task average performances between the two models). However, these models may prove more useful than our base Tulu 2 models for coding or structured output tasks!

Left: Average performance on our benchmark suite, minus Codex-Eval, for Tulu 2 and CodeTulu 2 models. Right: Codex-Eval performance for Tulu 2 and CodeTulu 2 models. CodeTulu 2 models greatly outperform Tulu 2 models at Codex-Eval, but underperform on average across other tasks.

Wrapping it all up

As part of our commitment to open science, we have released all aspects of this project publicly:

Our models and datasets are available here: https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101

Our training and evaluation codebase is here: https://github.com/allenai/open-instruct

Our Jax codebase (used for large model training) is here: https://github.com/hamishivi/EasyLM

We hope that our models, data, and code, make studying LLM training easier, and provide a solid foundation for further improvement and investigations into LLM adaptation! We also encourage interested readers to read our paper for more details and to see a more nuanced view of what skills different finetuning methods can help and hinder.

Check out our current openings, follow @allen_ai on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

--

--