Back

Jun 13, 2025

Yupp AI VIBE Score and Leaderboard (Beta)

A consumer-centric approach to robust and trustworthy AI evaluation

Jimmy Lin, Chief Scientist and Professor, University of Waterloo
Gilad Mishne, Co-founder and AI Lead
Pankaj Gupta, Co-founder and CEO

Back

The three of us met at Twitter circa 2010, where we built large-scale consumer products using machine learning (ML).¹ There we launched and scaled the first iterations of products like Twitter search,² trends,³ and recommendations⁴ to hundreds of millions of users around the world. Twitter aimed to be the global town square, and our experience taught us the power of a diverse multitude of voices capturing the pulse of the planet.

Before Twitter, Jimmy had been working on conversational agents⁵ since the late 1990s and rigorous academic information retrieval evaluations such as the Text Retrieval Conferences (TRECs), organized by the U.S. National Institute of Standards and Technology (NIST).⁶ For ML and product evaluation at Twitter, we utilized crowdsourced raters⁷ along with controlled experimentation via A/B tests in both pre-launch and production environments.⁸

Since Twitter, we have launched more consumer products to similar scales at Google and Coinbase. A common lesson? The best approach to accurately evaluate and improve a product is to look at how it’s being used by a large number of regular consumers across different regions of the world, as their natural behavior provides a reliable indicator of the product’s quality and appeal.

Today, Yupp hopes to bring this lesson to the task of evaluating modern AIs.

More specifically, we’re tackling the challenge of robust and trustworthy AI evaluation by bringing together insights from both industry and academia. We believe that the best way to achieve this is by building a compelling consumer AI product, which we announced in our companion blog post today launching yupp.ai. In this blog post, we are excited to announce the Yupp AI Leaderboard (Beta).

The Challenge of Robust and Trustworthy Evaluation

“To measure is to know.” AI researchers and developers want to know how good their models or systems are, because only then can they be improved. Thus, the problem of evaluation is foundational to the very progress of the field. While traditionally this has been called “model eval”, today it is much more: beyond large language models (LLMs), there are powerful AI systems and agents that engage in multiple input and output modalities while taking actions in the world.

There are two broad approaches to evaluation, dating back to the earliest days of machine learning: benchmark datasets and assessments by human raters.

Benchmark Datasets, such as AIME, MMLU-Pro, or HLE, are standardized collections of tasks and evaluation criteria used to systematically compare models. The biggest challenge with benchmarks is their static nature and inability to capture real-world use cases: When was the last time you posed a graduate-level question about chemistry to an AI? Furthermore, as AIs evolve to gain new capabilities, static benchmarks quickly become stale and performance saturates. Add to that the possibility of (inadvertent) data contamination and overfitting, and it is easy to see why benchmarks capture only a limited aspect of model capabilities.

Human Raters, on the other hand, are perceived as the “gold standard” of evaluation in machine learning because they capture what users actually want. Side-by-side comparisons of search results date back nearly two decades⁹ and crowdsourced human judgments have been used for ranking systems for about as long.¹⁰ Human evaluation has traditionally been slow and expensive, thus limiting its scale.

We aim to tackle these shortcomings at Yupp, starting from the first principles of how to achieve robust and trustworthy evaluation:

Robust means

representative, capturing diverse real-life use cases
realistic, reflecting what users actually care about
resistant to gaming, spam, noise, and other adversarial actions

Trustworthy means

fair and neutral, not biased in favor of any AI
transparent, in providing details on how rankings are computed
rigorous, in adhering to well-known scientific principles

These are guiding principles that we’ve taken to heart.

Introducing the Yupp AI VIBE Score and Leaderboard (Beta)

Powered by a consumer product that we launched today, we are sharing a first look at the beta version of our AI leaderboard, ranking AIs by what we simply call the Yupp VIBE (Vibe Intelligence BEnchmark) Score, aggregated from the preferences of users across the world interacting naturally with Yupp. The VIBE Score conveys the preferences of diverse, globally distributed users as they check the “vibe” of AIs for their own everyday use.

Our consumer-centric approach for scale

Based on our experience at Twitter and subsequent consumer Internet companies, we believe that robust and trustworthy evaluation requires scale, and that is best accomplished as a side effect of a great consumer experience. To that end, Yupp has a number of innovative product features worth highlighting.

User Privacy and Profiles
Nobody wants the world to see their private interactions with AIs. Some eval platforms force you to give up privacy, but Yupp doesn’t. Chats are private by default, but users have the choice to explicitly make their interactions public. Private prompts, however, still contribute to our leaderboard, because this can be done in a privacy-preserving manner.

We believe that our focus on privacy allows users to engage in more natural interactions. Users are also encouraged to build a profile that includes their age group, level of education, occupation, etc. The more profile information users share with Yupp, the better we can select the best AIs to meet their needs while they use the consumer product.

Rich Preferences
Unlike some other eval platforms, we’re not just gathering binary preferences – users can, and do, tell us what they like or dislike about the AI responses they encountered. Are they interesting? Vague? Issues with style or hallucinations? Users are also encouraged to provide freeform feedback.

Community-Aligned Incentive Mechanisms
Yupp provides free access to the latest AIs, but usage is metered via Yupp credits, which users get by providing feedback, creating a virtuous cycle that drives further usage. We have also added the ability to convert some of those credits to money. This might feel odd, but here we are leveraging our previous experiences in designing incentives for engaging consumer products. We believe that it is possible to provide small tokens of appreciation without introducing distortions of incentives. There is a large literature in behavioral economics here to draw from, and the design of incentives remains one of our active areas of research.

A leaderboard that delivers unique insights

The innovative product features of Yupp come together to create a leaderboard that delivers insights not possible with other platforms today.

Currently, the VIBE Score is similar to Elo-like scores (Bradley-Terry to be precise) used in chess tournaments and similar setups. But beyond a simple score, we already know much more about the models: For example, the most frequent complaint expressed by our users about one top model is its speed. We are experimenting with scoring algorithms that can incorporate these rich features and will share results from them with the community.

Coupled with rich preference data, user profile attributes provide the ability to segment users in a fine-grained manner. We have also built similarly sophisticated analytics on user prompts. This will be detailed in a forthcoming blog post – but as a preview, we are able to situate prompts along three dimensions: an intent category (such as fact seeking), topical tags (such as world history), and properties (such as French). Combining these analytical dimensions with rich feedback data, Yupp provides the ability to slice and dice evaluation data in ways not possible before. For example, our leaderboard currently shows that young people like different models from the entire population of users! Starting from this observation, AI developers can use our leaderboard to drill down and obtain a sample of public prompts to answer why.

Building Yupp on the pillars of rigorous research

Yupp introduces a number of unique product features, but how do we know they will help us to deliver robust and trustworthy evaluations? We have designed Yupp based on previous experiences and our commitment to a scientifically rigorous approach. To that end, we are currently exploring a number of topics:

Examining the impact of blinded results
Tools designed to collect preferences often hide identifying features in an attempt to reduce bias. We think that hiding model names might not be necessary or even effective in a consumer product at scale. For AI enthusiasts, models are stylistically distinct and can be easily uncovered, so blinding is ineffective. For most consumers, we feel that it doesn’t matter: the majority of users from our informal surveys don’t understand (or care about) the difference between, for example, Claude and Gemini. Nevertheless, we are running experiments to test this hypothesis through A/B testing. We’ll share the results of this experiment down the road!

Identifying and correcting biases
We are systematically investigating and correcting biases beyond non-blinded results, for example, response position (on the left or on the right), formatting, speed, and more. We’re utilizing mechanisms developed for similar setups to correct for such biases, drawing inspiration from a vast literature on search ranking and beyond.

Eliminating bad actors
Crowdsourced platforms naturally attract bad actors who engage in adversarial behavior. Leveraging our experience in tackling the problem of spam and bots at Twitter, we’ve developed sophisticated algorithms to discard low-quality data, ensuring the integrity of our rankings. We have also built out a dedicated Trust and Safety team and continue to invest significantly in this area.

Comparison to professional testers
In addition, we are running experiments to explore how various product features interact. For example, do private and public prompts yield different leaderboard rankings? Is there any misalignment in incentive mechanisms that distort user preferences? As part of these efforts, we have invested significant resources to commission raters on Yupp, managed by a well-known AI data provider. For these raters, we have validated user profiles and multiple layers of quality testing to give us a reference for calibration.

We are just coming out of stealth today, but we’re already gathering a wealth of diverse, rich preference feedback data from users all over the world.

How is Yupp different?

We are guided by two approaches that we believe are unique:

First, as described above, we seek to build a compelling consumer product. From our experiences at Twitter, Google, and beyond, we believe that scale will lead to diverse and rich interactions that are representative of real-world AI use around the world. This provides the starting point for robust and trustworthy evaluation.

Second, we will enshrine principles of fairness and transparency in technical solutions. It’s not sufficient to proclaim principles in blog posts or create policies, as both rely on good faith. We seek to construct systems with provable properties of credible neutrality, fairness, and robustness. For this, we’ll be leveraging open access and permissionless technologies like blockchains, cryptographic primitives and protocols like zero-knowledge proofs and challenge/response mechanisms, and privacy-preserving technologies like confidential computing. These technologies form building blocks that we will adapt to achieve our goals.

As a specific example, we have been thinking about how we can provide equitable access to all AI developers, from those at frontier labs pushing the state of the art to resource-limited graduate students who are also training and fine-tuning models. To this end, we will soon ship a novel feature: permissionless model evals. Using this feature, anyone (including students and AI hobbyists) can submit an AI to Yupp. We’ll orchestrate comparative evaluations on the user's behalf and then return feedback on how the AI stacked up against the others!

There are many more interesting questions we seek to explore. Just a few samples:

How do we ensure fairness and transparency in a scientifically rigorous manner?
How do we demonstrate adherence to stated principles in a provable manner?
What is the right way to share data while respecting the privacy of users?
How do we design incentive mechanisms to achieve large-scale unbiased evaluation data while being resistant to spam, gaming, and other adversarial actions?

We are just getting started, and we wish to engage the community in collaborations that will guide us to better evaluations. If you’re interested in working on these problems with us, drop us a note at research@yupp.ai and let’s talk! Stay tuned for more details in the coming months via our Twitter/X account.

Try Yupp today or check out our leaderboard!

Footnotes

¹ We’ve written about our experiences at Twitter with large-scale machine learning and shared our experiences on scaling data mining.

² Earlybird was the name of the real-time Tweet search engine, to which we later introduce architectural innovations to better handle velocity.

³ We've explored simple smoothing techniques for topic tracking.

⁴ WTF (“Who to Follow”) is Twitter’s user recommendation product, responsible for creating billions of connections between users based on shared interests, common connections, and other related factors.

⁵ Jimmy spent a large part of his Ph.D. at MIT working on question answering systems and conversational interfaces.

⁶ Jimmy has been a fixture in academic IR evaluations dating back to 2001.

⁷ We were early users of Mechanical Turk and CrowdFlower (now Figure Eight). Fun fact: the CrowdFlower offices were down the street from ours.

⁸ In keeping with the bird tradition in the early days of Twitter, our A/B testing framework was called Duck-Duck-Goose (DDG). We’re quite proud of that name.

⁹ Well documented instances of side-by-side comparisons in information retrieval date back nearly two decades (see this 2006 example and this interface from 2008); Here’s a 2012 video of Google explaining side-by-side comparisons in search.

¹⁰ Crowdsourced evals were used to rank machine translation systems as early as 2006; the idea expanded in scale and scope with the Netflix Prize (roughly around the same time). The Alexa Prize (circa 2017) represents a more modern iteration of using crowdsourced evals to build leaderboards and rankings.

Keep reading

View all