Navigation
Search
|
Leaderboard illusion: How big tech skewed AI rankings on Chatbot Arena
Friday May 2, 2025. 12:51 PM , from ComputerWorld
A handful of dominant AI companies have been quietly manipulating one of the most influential public leaderboards for chatbot models, potentially distorting perceptions of model performance and undermining open competition, according to a new study.
The research, titled “The Leaderboard Illusion,” was published by a team of experts from Cohere Labs, Stanford University, Princeton University, and other institutions. It scrutinized the operations of Chatbot Arena, a widely used public platform that allows users to compare generative AI models through pairwise voting on model responses to user prompts. The study revealed that major tech firms — including Meta, Google, and OpenAI — were given privileged access to test multiple versions of their AI models privately on Chatbot Arena. By selectively publishing only the highest-performing versions, these companies were able to boost their rankings, the study found. “Chatbot Arena currently permits a small group of preferred providers to test multiple models privately and only submit the score of the final preferred version,” the study said. Chatbot Arena, Google, Meta, and OpenAI did not respond to requests for comments on the study. Private testing privilege skews rankings The Chatbot Arena, launched in 2023, has rapidly become the go-to public benchmark for evaluating generative AI models through pairwise human comparisons. However, the new study reveals systemic flaws that undermine its integrity, most notably the ability of select developers to conduct undisclosed private testing. Meta reportedly tested 27 separate large language model variants in a single month in the lead-up to its Llama 4 release. Google and Amazon also submitted multiple hidden variants. In contrast, most smaller firms and academic labs submitted just one or two public models, unaware that such behind-the-scenes evaluation was possible. This “best-of-N” submission strategy, the researchers argue, violates the statistical assumptions of the Bradley-Terry model — the algorithm Chatbot Arena uses to rank AI systems based on head-to-head comparisons. To demonstrate the effect of this practice, the researchers conducted their own experiments on Chatbot Arena. In one case, they submitted two identical checkpoints of the same model under different aliases. Despite being functionally the same, the two versions received significantly different scores — a discrepancy of 17 points on the leaderboard. In another case, two slightly different versions of the same model were submitted. The variant with marginally better alignment to Chatbot Arena’s feedback dynamics outscored its sibling by nearly 40 points, with nine models falling in between the two in the final rankings. Disproportionate access to data The leaderboard distortion isn’t just about testing privileges. The study also highlights stark data access imbalances. Chatbot Arena collects user interactions and feedback data during every model comparison — data that can be crucial for training and fine-tuning models. Proprietary LLM providers such as OpenAI and Google received a disproportionately large share of this data. According to the study, OpenAI and Google received an estimated 19.2% and 20.4% of all Arena data, respectively. In contrast, 83 open-weight models shared only 29.7% of the data. Fully open-source models, which include many from academic and nonprofit organizations, collectively received just 8.8% of the total data. This uneven distribution stems from preferential sampling rates, where proprietary models are shown to users more frequently, and from opaque deprecation practices. The study uncovered that 205 out of 243 public models had been silently deprecated — meaning they were removed or sidelined from the platform without notification — and that open-source models were disproportionately affected. “Deprecation disproportionately impacts open-weight and open-source models, creating large asymmetries in data access over time,” the study stated. These dynamics not only favor the largest companies but also make it harder for new or smaller entrants to gather enough feedback data to improve or fairly compete. Leaderboard scores don’t always reflect real-world capability One of the study’s key findings is that access to Arena-specific data can significantly boost a model’s performance — but only within the confines of the leaderboard itself. In controlled experiments, researchers trained models using different proportions of Chatbot Arena data. When 70% of the training data came from the Arena, the model’s performance on ArenaHard — a benchmark set that mirrors Arena distribution — more than doubled, rising from a win rate of 23.5% to 49.9%. However, this performance bump did not translate into gains on broader academic benchmarks such as Massive Multitask Language Understanding(MMLU), which is a benchmark designed to measure knowledge acquired during pretraining by evaluating models. In fact, results on MMLU slightly declined, suggesting the models were tuning themselves narrowly to the Arena environment. “Leaderboard improvements driven by selective data and testing do not necessarily reflect broader advancements in model quality,” the study warned. Call for transparency and reform The study’s authors said these findings highlight a pressing need for reform in how public AI benchmarks are managed. They have called for greater transparency, urging Chatbot Arena organizers to prohibit score retraction, limit the number of private variants tested, and ensure fair sampling rates across providers. They also recommend that the leaderboard maintain and publish a comprehensive log of deprecated models to ensure clarity and accountability. “There is no reasonable scientific justification for allowing a handful of preferred providers to selectively disclose results,” the study added. “This skews Arena scores upwards and allows a handful of preferred providers to game the leaderboard.” The researchers acknowledge that Chatbot Arena was launched with the best of intentions — to provide a dynamic, community-driven benchmark during a time of rapid AI development. But they argue that successive policy choices and growing pressure from commercial interests have compromised its neutrality. While Chatbot Arena organizers have previously acknowledged the need for better governance, including in a blog post published in late 2024, the study suggests that current efforts fall short of addressing the systemic bias. What does it mean for the AI industry? The revelations come at a time when generative AI models are playing an increasingly central role in business, government, and society. Organizations evaluating AI systems for deployment — from chatbots and customer support to code generation and document analysis — often rely on public benchmarks to guide purchasing and adoption decisions. If those benchmarks are compromised, so too is the decision-making that depends on them. The researchers warn that the perception of model superiority based on Arena rankings may be misleading, especially when top placements are influenced more by internal access and tactical disclosure than actual innovation. “A distorted scoreboard doesn’t just mislead developers,” the study noted. “It misleads everyone betting on the future of AI.”
https://www.computerworld.com/article/3976355/leaderboard-illusion-how-big-tech-skewed-ai-rankings-o...
Related News |
25 sources
Current Date
May, Sat 3 - 00:59 CEST
|