Regístrese ahora para una mejor cotización personalizada!

Noticias calientes

Meta's Llama 4 'herd' controversy and AI contamination, explained

Apr, 09, 2025 Hi-network.com
llamas
Little Blue Wolf Productions/Getty Images

Meta introduced the fourth generation of its wildly popular Llama generative artificial intelligence program, called the Llama 4 "herd," over the weekend. Almost immediately, a debate ensued that was somewhat uncharacteristic of previous releases.

The herd is a collection of three different models, dubbed Behemoth, Scout, and Maverick. Meta says that Behemoth, which is still in development, will be "one of the smartest LLMs in the world" when it's done. It uses a total of two trillion neural "weights," or parameters, which would be the largest amount disclosed publicly by any researchers.

meta-platforms-2025-llama-4-herd-hero-image.png
Meta Platforms

Behemoth was then used to create the two smaller models, Scout and Maverick, via an increasingly popular approach known as "distillation," where the heavy compute work on the larger model can be passed down to the smaller ones in order to maximize the return on investment. 

Also: Microsoft is offering free AI skills training for everyone - how to sign up

Scout can run on a single Nvidia GPU chip and handle an extraordinarily large "context window" (the amount of input data it can juggle in memory) of 10 million "tokens," meaning words, characters, or multimedia data points. 

Maverick is a somewhat larger model than Scout and is meant to be run in a "distributed" fashion across multiple computers, resulting in what Meta says is its most efficient model to date in terms of the cost per million input and output tokens.

All of that sounds like pretty standard fair for a new large language model. However, controversy erupted almost immediately following Saturday's debut, controversy about what Meta is claiming for the models and how the company arrived at those claims.

A rumor, circulating on X and Reddit over the weekend, as related by TechCrunch's Kyle Wiggers, cited what various postings purport are remarks from someone on Meta's AI staff discussing the AI model struggling to deliver performance. No evidence has been provided to substantiate the authenticity of the rumor, but it spread rapidly. 

The anonymous post in question, written in Chinese, was translated in the re-postings. In the post, the supposed Meta staff member claims Llama 4 struggled to reach "state-of-the-art performance" and that the company tried a variety of measures to boost the apparent performance of Llama 4. The anonymous post claims that the individual has resigned in protest from Meta.

Also: Vint Cerf on how today's leaders can thrive in tomorrow's AI-enabled internet

The post was picked up by noted AI scholar and critic Gary Marcus on his Substack on Monday. In it, he wrote that the plausible-sounding rumor suggests Meta has struggled with the same issue as OpenAI and others, namely, diminishing returns from "scaling-up" AI models with increasing horsepower. 

The issue that Marcus raises, and that others have raised on X, pointing to the anonymous post, is that in adjusting Llama 4, Meta may have effectively cheated on benchmark tests of AI performance.

Technically, the issue, which has been mentioned increasingly in the field, is known as "contamination," where neural networks are trained on data that ends up being used as test questions, like a student who has access to the answers ahead of an exam.

In a post on X Monday, Meta's vice president in charge of generative AI, Ahmad Al-Dahle, addressed what he called "claims that we trained on test sets," saying, "that's simply not true and we would never do that."

The rumor, and the contamination allegations, seem to have had their basis, at least in part, in competitive claims made by Meta about Llama's achievements.

Also: The best AI for coding in 2025 (and what not to use)

Meta notes in its press release that the Llama 4 models are competitive with top models such as OpenAI's GPT-4o "across a broad range of widely reported benchmarks." 

One of those benchmark achievements was Llama 4 Maverick's score on LMArena, a site hosting chatbots. Meta prominently mentioned the victory. 

And on X, when Llama 4 was first released on Saturday, LMArena representatives anointed Llama 4 Maverick the "number one open model, surpassing DeepSeek," referring to DeepSeek AI's "R1" large language model. 

The response to LMArena's endorsement on X was mixed, with some claiming they had seen far less impressive results on their own. 

The Llama 4 models, like their predecessor versions, are what is known as "open-weighted," meaning a portion of the neural nets, the parameters, can be downloaded, installed, and run by anyone. The two smaller models, Scout and Maverick are currently available on Llama dot com and HuggingFace, while Behemoth is still in development and not yet available.

In his post on X, Al-Dahle conceded that since Llama 4's release on Saturday, the reports from those using the model have been mixed. 

Also: Is OpenAI doomed? Open-source models may crush it, warns expert

"Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations," wrote Al-Dahle. (You can try out Llama 4's Maverick model on Meta AI's website.)

In a follow-up post on X on Monday evening, LMArena representatives wrote, "We've seen questions from the community about the latest release of Llama-4 on Arena," and offered further validation. 

Again, several responses were fairly harsh toward both Meta and LMArena, with some posts accusing LMArena of not knowing its own benchmarks.

This is a developing situation. It does seem a surprising black mark for Meta, whose reputation with Llama has improved with each new Llama release before this one.

The furor may say as much about the current environment for generative AI. As Meta relates in its press release, Llama 4 was developed using several techniques that are suddenly important in the field. They include "mixture of experts," where parts of the large language model are turned off, making the model more efficient.

While Meta does not attribute that technique to any one party, it is part of a broad category of AI approaches known collectively as "sparsity." The sparsity approach suddenly became popular with the spectacular response to DeepSeek AI's R1. 

Also: Gemini Pro 2.5 is a stunningly capable coding assistant - and a big threat to ChatGPT

This is a surprising shift, as the prior Llama release, 3.1, showed a lot of leadership by the Meta team in advancing AI techniques to make progress in engineering.

And so, the controversy over performance may also indicate an emerging practice of taking sides in the "open-weight" realm of AI, where Meta, DeepSeek, and others are suddenly competing to see which group will be awarded the crown for highly technical, even abstruse, achievements.

llama-dot-com-descriptions
Meta Platforms
llama-4-model-card-overview
Meta Platforms

One interesting detail is what seems to be a rather rare slip in Meta's product branding. On the Llama dot com site's main page, the company appears to have misdescribed its own models. Maverick is described as a "natively multimodal model that offers single H100 GPU efficiency and a 10M context window" and Scout as a "natively multimodal model for image and text understanding and fast responses at a low cost."

However, the developer documentation lists the two as exactly the opposite: Scout as a single-GPU-capable model with a context window of 10 million tokens, and Maverick as a non-single-GPU model.

Want more stories about AI? Sign up for Innovation, our weekly newsletter.

tag-icon Etiquetas calientes: innovación

Copyright © 2014-2024 Hi-Network.com | HAILIAN TECHNOLOGY CO., LIMITED | All Rights Reserved.
Our company's operations and information are independent of the manufacturers' positions, nor a part of any listed trademarks company.