Hugging Face Upgrades Open LLM Leaderboard v2 for Enhanced AI Model Comparison

Spread the love

A monthly overview of things you need to know as an architect or aspiring architect.
View an example

We protect your privacy.
Facilitating the Spread of Knowledge and Innovation in Professional Software Development
Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.
Dany Lepage discusses the architectural journey of porting a hit VR title to seven non-VR platforms. He explains how his team solved the challenges of cross-progression, diverse input paradigms, and maintaining release velocity across Steam, iOS, and PlayStation. Beyond the tech, he shares candid lessons on the "product fit" gap when translating immersive social presence to 2D screens.
As AI accelerates delivery cycles, traditional centralized architecture becomes a bottleneck. This eMag brings together practitioner insights on decentralizing decision-making and moving from approval chains to guardrails. Discover frameworks for rethinking the architect’s role, creating enabling platforms, and balancing edge autonomy with the strategic coherence needed to scale effectively.
Julie Qiu explains how AI serves as a "thinking partner" for engineering leaders. She discusses five distinct roles – Archaeologist, Experimenter, Critic, Author, and Reviewer – to manage the cognitive load of 400+ repositories. She shares how AI provides the "RAM" needed to synthesize legacy context, pressure-test designs, and accelerate high-level architectural decisions.
Amit Navindgi discusses the systematic shift at Zoox from fragmented documentation to an AI-driven ecosystem. He explains how they built "Cortex," a secure platform integrating RAG, multi-modal LLMs, and contributor-friendly agent APIs. He shares practical strategies for driving adoption through AI champions and hackathons, emphasizing the move from deterministic workflows to autonomous agents.
Daniele Frasca explains the architectural evolution of Joyn, a German streaming giant. He discusses moving from fragile single-node setups to resilient serverless architectures using AWS. He shares insights on the Hub and Spoke pattern for data consistency, cell-based isolation to reduce blast radius, and cost-optimization strategies for achieving affordable multi-region active-active setups.
The more senior you become, the fewer people pressure-test your decisions. This 5-week cohort gives you that check.
Register Now.
Learn how leading engineering teams run AI in production—reliably, securely, and at scale.
Register Now.
A practical online cohort for senior engineers making decisions around retrieval, agents, evals, and AI infrastructure.
Register Now.
Learn what’s next in AI and software, from teams already doing it.
Register Now.
InfoQ Homepage News Hugging Face Upgrades Open LLM Leaderboard v2 for Enhanced AI Model Comparison
This item in japanese
Oct 10, 2024 3 min read
by
Vinod Goje
Hugging Face has recently released Open LLM Leaderboard v2, an upgraded version of their popular benchmarking platform for large language models.
Hugging Face created the Open LLM Leaderboard to provide a standardized evaluation setup for reference models, ensuring reproducible and comparable results.
The leaderboard serves multiple purposes for the AI community. It helps researchers and practitioners identify state-of-the-art open-source releases by providing reproducible scores that separate marketing claims from actual progress. It allows teams to evaluate their work, whether in pre-training or fine-tuning, by comparing methods openly against the best existing models. Additionally, it provides a platform for earning public recognition for advancements in LLM development.
The Open LLM Leaderboard has become a widely used resource in the machine learning community since its inception a year ago. According to Hugging Face, Open LLM Leaderboard has been visited by over 2 million unique users in the past 10 months, with around 300,000 community members actively collaborating on it monthly.
Open LLM Leaderboard v2 addresses limitations in the original version and keeps pace with rapid advancements in the open-source LLM field.
InfoQ spoke to Alina Lozovskaia, one of the Leaderboard maintainers at Hugging Face, to learn more about the motivation behind this update and its implications for the AI community.
InfoQ: You've changed the model ranking to use normalized scores, where random performance is 0 points and max score is 100 points, before averaging. How does this normalization method impact the relative weighting of each benchmark in the final score compared to just averaging raw scores?
Alina Lozovskaia: By normalizing each benchmark's scores to a scale where random performance is 0 and perfect performance is 100 before averaging, the relative weighting of each benchmark in the final score is adjusted based on how much a model's performance exceeds random chance. This method gives more weight to benchmarks where models perform close to random (harder benchmarks), highlighting small improvements over chance.
Conversely, benchmarks where models already score high in raw terms contribute proportionally less after normalization. As a result, the normalized averaging ensures that each benchmark influences the final score according to how much a model's performance surpasses mere guessing, leading to a fairer and more balanced overall ranking compared to simply averaging raw scores.
InfoQ: Benchmark data contamination has been an issue, with some models accidentally trained on data from TruthfulQA or GSM8K. What technical approaches are you taking to mitigate this for the new benchmarks? For example, are there ways to algorithmically detect potential contamination in model outputs?
Lozovskaia: In general, contamination detection is an active, but very recent research area: for example, the first workshop specifically on this topic occurred only this year, at ACL 2024 (it was the CONDA workshop that we sponsored). Since the field is very new, no algorithmic method is well established yet. We're therefore exploring emerging techniques (such as analyzing the likelihood of model outputs against uncontaminated references) though no method is without strong limitations at the moment.
We're also internally testing hypotheses to detect contamination specific to our Leaderboard and hope to share progress soon. We are also very thankful to our community, as we have also benefited a lot from their vigilance (users are always very quick to flag models with suspicious performance/likely contamination).
InfoQ: The MuSR benchmark seems to favor models with context window sizes of 10k tokens or higher. Do you anticipate a significant shift in LLM development towards this type of task?
Lozovskaia: There has been a recent trend towards extending the context length that LLMs can parse accurately, and improvements in this domain are going to be more and more important for a lot of business applications (extracting contents from several pages of documents, summarizing, accurately answering long discussions with users, etc).
We therefore have seen, and expect to see, more and more models with this long-context capability. However, general LLM development will likely balance this with other priorities like efficiency, task versatility, and performance on shorter-context tasks. One of the advantages of open-source models is that they allow everybody to get high performance on the specific use cases they need.
For those interested in exploring the world of large language models and their applications further, InfoQ offers "Large Language Models for Code," presented by Loubna Ben Allal at QCon London. Additionally, our AI, ML, and Data Engineering Trends Report for 2024 provides a comprehensive overview of the latest developments in the field.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
ONLINE INFOQ CERTIFICATION PROGRAM
A Cohort for Senior Engineers and Architects
Bring a real architecture or AI engineering challenge from your work. Spend 5 weeks pressure-testing your approach with senior peers from other companies and experienced facilitators. Explore the upcoming cohorts.

source

Hugging Face Upgrades Open LLM Leaderboard v2 for Enhanced AI Model Comparison – infoq.com

Leave a Comment Cancel Reply