Llama 2 and Llama 3 are two generations of Meta.ai's large language model, Llama. They are both open source and are built using standard transformer training, but the capabilities of both are quite distinct, with Llama 3 having been trained on many, many more parameters, leading to greater capabilities and more emergent behaviors.
Released in July 2023.
Trained on smaller datasets.
Available models include 69B, 13B, and 6.7B.
Context length of 4,096 tokens.
Primarily a text-only LLM.
Open-source.
Released in April 2024.
Trained on much larger datasets.
Much larger 128,000 token context length.
Available models include 405B, 70B, and 8B.
Supports up to 30 languages,
Designed to be multi-modal eventually.
Open-source.
Llama 2 launched in 2023 and was, at the time, Meta's most capable large language model. However, Llama 3 arrived over a year later and is built on much more training data, with much greater capabilities. It has since vastly surpassed Llama 2 in every way. It's faster; has a much larger context window; will eventually accept inputs and outputs of images, video, and audio; and it supports a wide range of languages.
In comparison, Llama 2 is incredibly limited, with a major focus on English over other languages, and its training set was far smaller. Its top model's parameters were a mere fraction of those used to train the very top models of Llama 3 and its latest version, 3.1.
Cost 22,000 petaflops a day to train.
Trained on two trillion tokens of data.
Trained on older hardware.
Trained on data up to 2023.
Mostly trained on English data.
Expensive to train: over 440,000 petaflops per day
Trained on 15 trillion tokens -- around seven times that of Llama 2.
Used so much hardware time that Meta had to limit model training.
Used millions of tokens of human input for fine tuning.
Trained on data up to 2024.
Upwards of 5% of data was not English-language.
The main advantage of Llama 3 is that it trained on more data. It used over 15 trillion tokens, with extensive pre-training and human fine-tuning after the fact. Its top model, 405B, is so named because it uses 405 billion parameters to make its decisions based on its extensive training data.
Meta introduced new training practices for the development of Llama 3 to optimize the process. This process included automated error detection, as well as the use of newer hardware. Llama 3 utilized tens of thousands of H100 Nvidia GPUs to train each of the models and specifically limited the time that the 70B model was trained for because the hardware time was needed elsewhere.
Llama 3 was much more expensive to train, though. Its use of newer hardware and the demands placed on it means it costs Meta a lot of money to train