Thoughts on Llama 3

Matthias Plappert, Durk Kingma, Max Chen, Cage Zhong, and Penny Deng

Meta has announced the 3rd version of their open large language model, Llama 3. In this blog post we’ll dive into the details and provide our thoughts on how Llama 3 will shape the industry.

Key Findings

  • Llama 3 is the latest model released by Meta and currently comes in 8B and 70B variants with a 400B variant currently in training.
  • Llama 3 was pre-trained on 15T tokens of English text and code, with about 5% of the training data being non-English in preparation for future multilingual models.
  • Meta has collected 10M manually annotated data points that went through multiple rounds of quality control for their instruction tuning dataset.
  • Overall data is key: Meta appears to have gone to extraordinary lengths to curate a very large pre-training dataset and has spent significant resources on the post-training instruction tuning dataset.
  • The resulting models are very strong: Llama 3 70B is competitive with some frontier models and ranks #2 after GPT-4 Turbo on real-world tests in English conducted by LMSYS. It is likely that Llama 3 400B will match or outperform most frontier models, on English-speaking tasks, once it finishes training.
  • By training on 15T tokens, especially the smaller models, Llama 3 is trained significantly longer than earlier models. This makes sense since the additional training compute is amortized during inference, again, highlighting the importance of inference as these models get broadly deployed and on-device. For example, if a 8B model is now good enough for tasks that previously required a 70B model, the cost per 1M tokens gets reduced by a factor of 4.5x from $0.90 to $0.20 (based on together.ai's pricing).
  • We estimate that Llama 3 400B will use a compute budget of about 5.4e25 FLOPS, which is on the same order of magnitude as GPT-4 training and just below the threshold of the Biden Administration’s executive order. Training is expected to take about 97 days on one of Meta’s 16k Nvidia H100 clusters (this assumes smooth sailing, so the real training time could be significantly longer).
  • Llama 3 400B is pushing into frontier model territory, which will further increase the competitive pressure on companies like OpenAI, Anthropic, Mistral and Google. Once Llama 3 400B gets released, the open source community will, for the first time, have access to a frontier model. While this will surely enable interesting applications, we suspect that the barriers to work with such a large model are still large and we therefore expect the 8B model to remain the most popular option for the open source community.
  • The Llama 3 model family still has a few important shortcomings: lack of long context, missing support for multilingual use cases, and weaker reasoning and problem solving skills (as quantified by the MATH benchmark).

Background

Meta has been training LLMs for a while now: They published Llama 1 in February 2023, followed by Llama 2 in July 2023. Llama 2 in particular has been hugely influential: it’s been used extensively by the open-source community, with the 7B model in particular having been downloaded on HuggingFace more than 1 million times, and is one of the go-to models for the open-source community to build on top of.

Technical Details of Llama 3

While the paper is not out yet, the blog post and model card contain lots of information on the technical details of Llama 3.

Like before, Llama 3 is a dense Transformer model that was pre-trained on a very large amount of publicly available data, about 15T tokens (we’ll discuss the dataset separately, as it is one of the most interesting aspects of this release). Also, similar to before, Llama 3 is a model family that comes in a variety of sizes: the smallest model now has 8B parameters and there is still a 70B model as well. Meta is also pushing the limits though and is currently in the process of training a 400B parameter model, which is a significant step up and would make this Llama 3 400B the largest open source model ever released.

Llama 3 increases the context window to 8k, which is a modest improvement over Llama 2’s context window of 4k. However, compared to other models, this is quite limited: OpenAI’s GPT-4 Turbo model has up to 128k context, Anthropic’s Claude 3 model supports up to 200k tokens, Google’s Gemini model supports up to 1M tokens and Mistral’s Large model supports up to 32k tokens. So the context size of Llama 3 is a bit of a disappointment.

As mentioned before, the pre-training stage used a dataset of 15T tokens, which is a massive amount. This is perhaps the most interesting and surprising aspect of this model and we’ll discuss the size of the pre-training dataset separately. Other than its size, the pre-training data was collected from “publicly available sources” (as is tradition in the industry these days, Meta doesn’t elaborate on the details). The majority of this is English text but the amount of code is 4x more than what was used for Llama 2 (which significantly improves Llama 3’s coding abilities over Llama 2). Meta also decided to include 5% of non-English text in a preparation for a future multilingual version. Still, Llama 3 is mostly an English language model and we expect OpenAI’s, Google’s and Mistral’s models to do much better on non-English tasks than Meta’s Llama 3.

Each Llama 3 model also comes in two variants: A model snapshot right after pre-training and a model that was further instruction-tuned. Instruction-tuning is a very important step to make these models useful and it appears that Meta went to extraordinary lengths to really nail this. They use a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO) and direct preference optimization (DPO). A key factor for the success of this step is the dataset used, since instruction-tuning requires a collection of prompts, model responses, and labels that indicate which model response was preferred. Meta appears to have collected more than 10M such data points that went through multiple rounds of human quality control, which is a very serious undertaking and was likely both a time-consuming and expensive process.

Discussion

Performance

The Llama 3 family of models performs extremely well in benchmarks. For the smaller models (the 8B and 70B), performance is significantly better than for other recently released models like Google’s Gemma and Mistral’s 7B models. Llama 70B is even competitive with Google’s Gemini Pro 1.5 model, which is quite an achievement.

Performance of Llama 3 compared against similarly sized models. Image taken from the Llama 3 blog post.

An interesting comparison is between Llama 3 8B, Llama 3 70B and Anthropic’s excellent Claude 3 Haiku model. While Llama 3 8B has worse performance across all benchmarks, it is more affordable than Claude 3 Haiku: As of this writing, prices per 1M input tokens are between $0.05 - $0.20 and per 1M output tokens between $0.20 - $0.25 for Llama 3 8B. In contrast, Anthropic charges $0.25 per 1M input and $1.25 per 1M output tokens. Llama 3 70B is stronger than Claude 3 Haiku (except for MATH) but also significantly more expensive.

Comparison of Llama 3 8B, Llama 3 70B and Claude 3 Haiku in terms of performance and cost. Figures taken from the Llama 3 blog post and Claude 3 blog post. Prices taken from Together, Perplexity, Replicate and Anthropic on April 25, 2024.

We also compare Llama 3 70B against Databricks DBRX, which was released only about a month ago and was positioned as a competitor to medium-sized models. DBRX is a 132B MoE model that was trained on 12T tokens. Clearly Llama 3 70B is a substantial improvement and significantly outperforms the Databricks model across the board. However, it should be noted that the Databricks model supports up to 32k tokens of context and only activates 36B parameters when using batch size 1 inference, so it still has some advantages over Llama 3.

Comparison of Meta’s Llama 3 70B and Databricks’ DBRX 132B MoE model. Numbers for Databricks DBRX are taken from the DBRX blog post.

Finally, the currently still unfinished Llama 3 400B model is even more impressive: It can compete with and sometimes even overtake current frontier models from OpenAI, Anthropic, and Google even though it has not yet finished training. This is quite an important point and very soon we’ll have an open-source model that is comparable and partially outperforms proprietary frontier models.

Preliminary performance of the (still unfinished and unreleased) Llama 3 400B model benchmarked against current frontier models: GPT-4 Turbo (2024-04-09), Claude 3 Opus, and Gemini Pro 1.5. Data for Llama 3 400B taken from the Llama 3 blog post, and data for the others taken from OpenAI’s very recent benchmark results. Note: 1) Llama 3, performance is evaluated on GSM-8K whereas all other models are evaluated on MGSM, which are the same math problems but translated to different languages. 

However, there is one noticeable gap: the MATH dataset. GPT-4 Turbo is on another level and Llama 3 400B falls behind all other frontier models on this benchmark. We believe that this matters since MATH is an especially challenging dataset that tests a model’s reasoning abilities.

It is also important to note that Llama 3 will only excel on English benchmarks. This is apparent by the fact that GSM-8K is used, whereas the other frontier models were used on the similar but multilingual MGSM dataset. This lack of support for other languages is another severe limitation of Llama 3, which makes it unsuitable for non-English speaking applications. As mentioned before, Meta is actively working on this though and we believe this is a transient issue that will be resolved quickly.

Finally, we like to look at the LMSYS Chatbot Arena benchmark to gauge real-world performance of new models. Briefly speaking, Chatbot Arena pits two models against each other and the user picks their preferred answer, without knowing which model generated it. Over time, this makes it possible to compute an Elo ranking of models.

Ranking of Llama 3 70B and 8B on the English-only category on the LMSYS Chatbot Arena Leaderboard as of 2024-04-24.

At the time of this writing, Llama 3 70B Instruct ranks just below GPT-4 Turbo, Claude 3 and Gemini 1.5 Pro and in the English-only category Llama 3 comes in 2nd place right after GPT-4 on the leaderboard. Again, this is quite an achievement: A 70B model is competitive and outperforms much larger, proprietary models. We also believe Meta’s efforts to nail the post-training instruction tuning pays off here and makes these models really shine.


Dataset Size and Training vs. Inference Compute

Perhaps the most surprising part of the Llama 3 release was the amount of data that Meta chose to train these models on: 15T tokens of text. This is very far from the so-called Chinchilla optimum, which indicates that a compute-optimal training budget for a 8B model would be ~200B tokens, i.e. Llama 3 used a token budget that was ~75x more than optimal.

This might seem confusing at first. But an increasingly important consideration for these models is their ability to scale during inference (we discussed this during our recent Sora blog post as well).

Basically, the Chinchilla-optimal criteria only considers the optimal compute budget for training, but in reality these models get deployed, and are now getting deployed increasingly broadly to many users. It therefore makes perfect sense to overtrain models, since training compute is paid once but inference compute is incurred continuously and is a direct function of how many users access the model. The inefficiency of over-training a smaller model easily gets amortized over the inference lifetime of the model (since the model is smaller, each request incurs much less compute). In other words, what Meta chose to do makes perfect sense if you optimize for a model that you plan to deploy very broadly.

An interesting side-note here is that figuring out for how long your model keeps improving was actually fairly difficult due to the fact that most training runs required a learning rate schedule where you need to decide on the total token budget a-priori. We speculate that Meta’s recently announced schedule-free optimizer research was one of the ingredients that made it easy for Meta to adopt a wait-and-see mindset where they did not have to decide on a token budget in advance and could instead train for as long as they saw improvements–which perhaps turned out to be longer than anticipated.

We believe that this very long training discovery is a big deal: It will further decrease the cost of running these models at scale since inference compute is directly proportional to the size of the model and Llama 3 demonstrates that you can squeeze much more performance out of these smaller models than previously thought.

Training Compute

We like to look at compute budgets at Factorial Funds (see also our recent blog post on Sora’s compute estimate, for example). For Llama 3, the calculations are fortunately quite straightforward thanks to Meta releasing a model card with some details.
 

The total compute spent by Llama 3 8B and 70B for pre-training. Figure taken from the Llama 3 model card.

Meta states that they trained Llama 3 on 16k Nvidia H100 GPUs and achieved about 400 TFLOPS per GPU. We can thus compute the total pre-training compute budget for both models:

  • Llama 3 8B:     1.3e6 * 400e12 * 3600 = 1.872e24 FLOPS
  • Llama 3 70B:    6.4e6 * 400e12 * 3600 = 9.216e24 FLOPS

We can also figure out the expected compute using the short-hand formula of C = 6 * N * D where N are the model parameters and D are the number of training tokens. We can use this to also extrapolate to the 400B model, for which Meta did not yet release any compute figures:

  • Llama 3 8B:        6 * 8e9 * 15e12 = 7.2e23 FLOPS
  • Llama 3 70B:      6 * 70e9 * 15e12 = 6.3e24 FLOPS
  • Llama 3 400B:    6 * 400e9 * 15e12 = 3.6e25 FLOPS

Obviously these estimates disagree and we believe the first estimate is more accurate (since it was directly published by Meta and because Meta says Llama 3 was trained on “15T+ tokens”). If we assume that the actual compute for 400B is about 50% more than our 2nd estimate, we arrive at a total compute budget of 5.4e25 FLOPS. This is quite a lot: For reference, the Biden Administration’s executive order has reporting requirements for model training runs that utilize more than 1e26 FLOPS of compute, and Our World in Data estimates that GPT-4 used 2.1e24 FLOPS for training (with large error bars though).

In terms of wall-clock time, and using the 1st estimate (and our extrapolation to the 400B model assuming 5.4e25 FLOPS), on 16k Nvidia H100 GPUs and with an average throughput of 400 TFLOPS per GPU, the total training time comes down to:

  • Llama 3 8B:         ~3 days
  • Llama 3 70B:       ~17 days
  • Llama 3 400B:     ~97 days
     
Share in social media:

More Blogs

Under The Hood: How OpenAI's Sora Model Works
Matthias Plappert

OpenAI’s Sora model has amazed the world by its ability to generate extremely realistic videos of a wide variety of scenes. In this blog post, we dive into some of the technical details behind Sora. We also talk about our current thinking around the implications of these video models. Finally, we discuss our thoughts around the compute used for training models like Sora and present projections for how that training compute compares to inference, which has meaningful indications for estimated future GPU demand.

Learn more
Under The Hood: How OpenAI's Sora Model Works
Matthias Plappert

OpenAI’s Sora model has amazed the world by its ability to generate extremely realistic videos of a wide variety of scenes. In this blog post, we dive into some of the technical details behind Sora. We also talk about our current thinking around the implications of these video models. Finally, we discuss our thoughts around the compute used for training models like Sora and present projections for how that training compute compares to inference, which has meaningful indications for estimated future GPU demand.

Learn more