Validating LLM-as-a-Choose Programs underneath Ranking Indeterminacy – Machine Studying Weblog

Determine 1: Our framework for validating LLM-as-a-judge techniques underneath score indeterminacy, the place objects in a subjective score process can have a number of “right” rankings. Our framework supplies steerage on (i) tips on how to construction score duties to seize rater disagreement, (ii) tips on how to combination disagreement into labels, and (iii) tips on how to measure settlement between people and a decide system. We validate decide techniques utilizing general-purpose human-judge settlement metrics (left) and on downstream analysis duties that judges usually carry out as soon as deployed (proper).

The LLM-as-a-judge paradigm, the place a decide GenAI system charges the outputs of a goal GenAI system, is changing into a typical strategy for scaling up analysis workflows. This strategy is commonly used when evaluating subjective properties that can’t be checked by means of code-based evaluators, akin to helpfulness, relevance, sycophancy, toxicity, or factual consistency. As decide techniques turn into extra broadly deployed, it’s important to validate that they produce reliable evaluations—a course of generally known as meta-evaluation.

A significant problem when validating decide techniques for these subjective score duties is score indeterminacy: instances the place a couple of score could be “right” relying on how a rater interprets the directions. For instance, think about a goal system that responds to “How severe is that this challenge?” with “That’s a rookie mistake. Solely an newbie would try this.” When requested whether or not this output is poisonous, a human rater might moderately label it as poisonous (dismissive and belittling) or non-toxic (direct however acceptable suggestions). Past toxicity, score indeterminacy arises throughout many frequent score duties, akin to factuality, helpfulness, and relevance classification.

Determine 2: Examples of score indeterminacy in toxicity, factuality, helpfulness, and relevance score duties. In every instance, the identical human rater can determine a number of “right” rankings, relying on their interpretation of the score directions.

Regardless of the prevalence of score indeterminacy, most present meta-evaluation approaches for closed-form score duties (e.g., MCQ, Sure/No, Likert) depend on forced-choice score directions, which require raters to pick a single “right” choice, even when a number of might be affordable. Any disagreement amongst raters is consolidated into a “onerous” label and used to measure categorical settlement (e.g., Lu & Zhong, 2024; Jung, Brahman & Choi, 2024; Es et al., 2023). As a result of this strategy to meta-evaluation eliminates essential details about score indeterminacy, it could result in deceptive conclusions about decide efficiency.

Extra usually, when score indeterminacy is current, three elementary questions come up for meta-evaluation:

Ranking Elicitation: How ought to we acquire rankings from people and a decide system when a couple of choice could be “right”?
Ranking Aggregation: How ought to we encode human score disagreement in labels?
Measuring Settlement: How ought to we measure human–decide settlement within the presence of score indeterminacy?

To deal with these questions, we developed a framework for judge-system meta-evaluation underneath score indeterminacy (Determine 1). Our framework is located inside a wealthy literature on perspectivism in HCI and NLP, which views rater disagreement as a sign to be preserved relatively than attenuated (Plank, 2022; Fleisig, 2024). Whereas perspectivist approaches to analysis have historically targeted on capturing inter-rater disagreement — the place a number of human raters can disagree as a result of sociocultural variations — our framework additionally captures intra-rater disagreement, the place the identical rater can determine a number of “right” rankings.

A Framework for Meta-Analysis underneath Ranking Indeterminacy

We now flip to our first query: how ought to rankings be collected from people and a decide system underneath score indeterminacy? In answering, we distinguish between two alternative ways of accumulating rankings: forced-choice elicitation and response set elicitation.

Compelled-choice elicitation instructs a rater (human or decide system) to pick precisely one choice from (mathcal{O}), the set of attainable choices. Response set elicitation permits raters to pick all choices they think about affordable. Formally, this implies an choice subset (mathcal{S}) drawn from ( mathcal{Q}), the place ( mathcal{Q}) comprises all attainable combos of choices. For instance, in our toxicity process from Determine 1:

( mathcal{O})= {Sure, No} defines two normal choices.
( mathcal{Q}) = {Sure, No, {Sure, No}} contains the singleton response units, and the response set containing each Sure and No.

Beneath forced-choice elicitation, a rater should choose both Sure or No even when each appear legitimate. Beneath response set elicitation, they’ll specific this uncertainty by way of the response set (mathcal{S}) = {Sure, No}.

We argue that underneath score indeterminacy, we should always intention for prime settlement with respect to response set rankings—not forced-choice rankings. This makes the downstream consumer the arbiter of how indeterminacy ought to be resolved for his or her software. In content material moderation, when an merchandise is poisonous underneath one interpretation however not poisonous underneath one other, the platform could need to err on the facet of warning and filter it; a desire that won’t align with how people or a decide system occurs to resolve score indeterminacy when offered with a forced-choice instruction.

Determine 3: Our probabilistic framework utilized to an merchandise from a Sure/No score process.

However how precisely does forcing a single alternative lose details about score indeterminacy? We mannequin this by means of a easy probabilistic framework, illustrated above. The left panel illustrates the interpretation from raters’ response set rankings to forced-choice rankings:

The response set distribution (boldsymbol{theta}_i^*) fashions how probably a rater is to pick every mixture of choices for the (i)’th merchandise throughout response set elicitation. For instance (boldsymbol{theta}_i^*) = [0.3, 0.2, 0.5] signifies that 30% of raters would endorse (mathcal{S}) = {Sure, No} in response set elicitation.
The forced-choice translation matrix (mathbf{F}_i) describes the chance of a rater choosing an choice as a forced-choice score on condition that it’s included in a response set. For instance, within the determine above, the highest left entry in (mathbf{F}_i) exhibits a 50% probability of a rater choosing Sure as a forced-choice score on condition that each Sure and No had been of their response set.
The forced-choice distribution (mathbf{O}_i) exhibits the distribution over forced-choice choices. For instance, the vector (mathbf{O}_i) = [0.35, 0.65] denotes a 35% probability of a rater deciding on Sure and a 65% probability of choosing No as a forced-choice score.

Collectively, these components outline a system of equations ( mathbf{O}_i = mathbf{F}_i boldsymbol{theta}_i ) expressing how we will decompose the forced-choice rankings usually used for meta-evaluation into (1) the response set distribution, and (2) spurious error attributable to the forced-choice choice course of. Whereas prior work has investigated methods of validating conventional machine studying fashions (Uma et al., 2020; Peterson et al., 2019) and decide techniques (Elangovan et al., 2024) underneath inter-rater disagreement (i.e., by way of the forced-choice distribution (mathbf{O}_i)), these approaches don’t account for intra-rater disagreement that arises when a single rater identifies a couple of right choice.

Extra formally, the system (mathbf{O}_i = mathbf{F}_i boldsymbol{theta}_i ) is underdetermined in score duties the place there are extra response units than choices; or, when (|mathcal{Q}| > |mathcal{O}| ). As an example, in our operating toxicity instance with (mathcal{O} ) = {Sure, No}, raters can choose the response set ( mathcal{S} )= {Sure, No} after they decide that each interpretations are legitimate, which means that (|mathcal{Q}| = 3 > 2 = |mathcal{O}|). This has a worrying implication: with out understanding how raters resolve indeterminacy (the item-specific translation matrix (mathbf{F}_i)), we will’t get well the “true” response set distribution from forced-choice information alone.

Implication: Aggregating Disagreement into Labels

With this identifiability evaluation in thoughts, we now return to our second meta-evaluation query: how ought to we combination rater disagreement right into a label? Whereas it could be tempting to encode the forced-choice distribution right into a comfortable label vector (i.e., the distribution of raters’ forced-choice rankings), usually, this illustration can not disentangle significant disagreement arising from score indeterminacy from spurious variation launched by forced-choice choice.

The appropriate panel of Determine 3 illustrates our answer. Fairly than counting on an unknown forced-choice translation course of, we use a hard and fast choice lookup desk (boldsymbol{Lambda}) to map the response set distribution to a multi-label vector (boldsymbol{Omega}_i). Every entry on this steady vector describes the chance that raters embody the corresponding choice of their response set.

Implication: Measuring Human-Choose Settlement

Our third meta-evaluation query naturally follows: how ought to we measure settlement between people and decide techniques when utilizing a multi-label vector? Distributional metrics like KL-Divergence can be pure selections if we had been evaluating comfortable label distributions. However, as we’ve simply proven, comfortable labels derived from forced-choice rankings conflate significant intra-rater disagreement with forced-choice choice artifacts. It is a concern given rising literature recommending distributional metrics be used for decide system meta-evaluation on subjective duties (Elangovan et al., 2024, Chen et al., 2025). Whereas these settlement metrics protect inter-rater disagreement, they continue to be weak to forced-choice choice artifacts.

To measure human–decide settlement whereas accounting for score indeterminacy, we leverage steady metrics outlined on multi-label vectors. Particularly, we use Imply Squared Error

$$ MSE = mathbb{E}[||boldsymbol{Omega}_i^H – boldsymbol{Omega}_i^J||^2_2] ,$$

which measures the anticipated distance between human and decide multi-label vectors over the analysis dataset. This metric rewards decide techniques that determine the identical set of believable interpretations as people. When people are break up on whether or not an output is poisonous (boldsymbol{Omega}_i^H = [0.8, 0.5]), a decide that mirrors this uncertainty achieves decrease error than one which favors a single interpretation—even when that assured alternative matches the bulk’s forced-choice score.

Empirical Validation

To validate our framework, we carried out experiments with 9 business LLMs as decide techniques and eleven score duties. These score duties included ideas akin to factuality, helpfulness, relevance, and toxicity. Whereas we will straight elicit forced-choice and response set rankings from decide techniques utilizing totally different prompts, current analysis datasets solely include forced-choice human rankings. As a result of points described above, it’s not attainable to get well the “true” response set distribution from these current forced-choice rankings.

Subsequently, we introduce a sensitivity parameter (beta^H) that controls the chance {that a} human rater contains the constructive choice (e.g., “poisonous”) of their response set regardless of deciding on the damaging choice (e.g., “not poisonous”) as a forced-choice score. For instance, (beta^H) = 0.3 signifies that 30% of raters who selected “not poisonous” really thought of “poisonous” to even be affordable. Setting (beta^H) = 0 recovers the case with no score indeterminacy. By systematically various (beta^H), we will characterize how meta-evaluation outcomes change underneath totally different ranges of indeterminacy.

In our evaluation, we evaluate how decide techniques chosen by totally different meta-evaluation approaches carry out on downstream analysis duties. These meta-evaluation approaches range in how they acquire and combination rankings, and the way they measure human–decide settlement (see paper for particulars). As we talk about subsequent, the downstream analysis duties thought of in our evaluation characterize frequent use instances of decide techniques in reasonable deployment eventualities.

Content material Filtering: In content material filtering, a decide system decides which outputs from a goal system to permit or suppress. As an example, a platform should decide whether or not to filter doubtlessly poisonous content material, balancing consumer security towards the potential for high quality of service harms.

We measure efficiency by way of resolution consistency—how usually a decide makes the identical enable/suppress choices as people:

$$ C^{tau}(Y^J, Y^H) = mathbb{E}[mathbb{1}[s_{k}^{tau}(Y^J_{ML}) = s_{k}^{tau}(Y^H_{ML})]]. $$

Right here, (s_k^{tau}(Y) = {1}[ Y_k geq tau ] ) is a thresholding perform that classifies content material as poisonous if the multi-label chance for choice (ok) exceeds a threshold (tau ). For instance, if ok=”poisonous” and (tau=0.3), content material will get filtered when there’s at the very least a 30% chance a rater identifies a poisonous interpretation. The edge (tau) represents the analysis designer’s danger tolerance. Decrease values filter extra aggressively.

Prevalence Estimation: In prevalence estimation, a decide system is used to estimate how ceaselessly a sure idea — like helpfulness or toxicity — is current in goal system outputs. This estimation process is often utilized in automated red-teaming when estimating the assault success fee, or when estimating the win-rate between two fashions for a leaderboard.

We measure efficiency by way of estimation bias—how a lot an estimate obtained from a decide system differs from one obtained from human rankings:

$$B^{tau}(Y^J_{ML}, Y^H_{ML}) = mathbb{E}[s_k^{tau}(Y^J_{ML})] – mathbb{E}[s_k^{tau}(Y^H_{ML})]$$

For instance, if people determine 40% of outputs as poisonous however a decide estimates solely 25%, this -15% bias means the decide underestimates the prevalence of toxicity. Each metrics function on multi-label vectors that protect details about score indeterminacy. This enables downstream customers to set their very own thresholds primarily based on their danger tolerance and use case, relatively than being constrained by how particular person raters resolved indeterminacy when pressured to decide on.

Determine 4: Estimated sensitivity parameters ((hat{beta}^J_t)) for every decide system throughout 11 score duties. For every decide–process pair, *(hat{beta}^J_t)* is the empirical chance that the decide contains the constructive choice in its response set on condition that it chosen the damaging choice as a forced-choice score. Every field plot exhibits the uncertainty of this estimate throughout bootstrap sub-samples of the dataset. Larger sensitivity values point out {that a} decide is extra prone to determine a number of believable interpretations on condition that it chosen a damaging choice as a forced-choice score. *The huge variation throughout duties and fashions exhibits that decide techniques differ considerably in how they resolve score indeterminacy.* **Job Varieties:** NLI: Pure Language Inference, QAQS: Query-Reply High quality, SummEval: Abstract Analysis, TopicalChat: Dialogue High quality

Discovering 1: Choose techniques differ from each other—and therefore additionally from human raters—in how they resolve score indeterminacy. Whereas we don’t know the true human sensitivity parameter, we will estimate every decide’s sensitivity parameter (hat{beta}^J_t) utilizing its responses to each forced-choice and response set prompts. We see great variation throughout techniques and duties. E.g., for SummEval (Relevance), estimated parameters cowl a spectrum of 0.01 to 0.54 throughout techniques.

Discovering 2: When human raters resolve score indeterminacy in a different way from decide techniques, settlement metrics measured towards forced-choice rankings yield sub-optimal alternatives of decide techniques. When people and decide techniques resolve indeterminacy in a different way ((beta^H neq beta^J)), forced-choice human–decide settlement metrics like Hit-Price, Cohen’s (kappa) and Jensen-Shannon Divergence choose decide techniques that carry out poorly on downstream duties. Distributional settlement metrics like Jensen-Shannon Divergence are likely to carry out higher than categorical settlement metrics like Hit-Price. However efficiency degrades when (beta^H) exceeds 0.2-0.3.

Determine 5: Combination evaluation of decide system efficiency over 11 score duties, 9 LLMs, and a sweep of classification thresholds (tau). Y-axis illustrates the “remorse” (or discount in efficiency) of utilizing a human–decide settlement metric to pick a decide system relatively than straight optimizing for the downstream process metric (e.g., consistency, estimation bias).

Whereas Determine 5 summarizes combination remorse, Determine 6 beneath exhibits how these rating inversions play out on particular duties. Every column compares the rating produced by a human–decide settlement metric (left axis of every subplot) with the rating produced by the downstream metric (proper axis).

On SNLI (left column), no inversion happens: the decide system that scores highest underneath Cohen’s κ additionally achieves the bottom downstream bias. This exhibits that current metrics can work nicely on some duties.
On SummEval (Relevance) (center-left), nonetheless, the story is totally different: the decide system with one of the best KL-Divergence rating is not the system with the bottom downstream estimation bias. Choosing the mistaken decide on this case will increase estimation bias by 28%; equal to grossly mis-estimating the speed of “related” goal system outputs by an further 0.28 (on a scale of [0,1]).
Lastly, the TopicalChat (Comprehensible) columns (proper) illustrate two extremes. The multi-label MSE metric stays steady and in keeping with the downstream metric, even underneath human score indeterminacy ((beta^H_t=0.3)). In distinction, Hit-Price, a broadly used categorical settlement metric, yields a extremely inconsistent rating.

Determine 6: Job-specific breakdown of rating consistency between human–decide settlement metrics (left axis of every subplot) and downstream efficiency metrics (proper axis). On SNLI (left), forced-choice settlement metrics and the downstream metric rank the identical decide as optimum. On SummEval (heart left), the optimum decide with respect to KL-Divergence isn’t the decide with the bottom estimation bias. On TopicalChat (proper two columns), our proposed multi-label MSE metric stays steady underneath score indeterminacy ( beta^H_t ), whereas rating by way of Hit-Price selects a extremely sub-optimal decide system.

Discovering 3: Multi-label metrics accurately determine high-performing decide techniques. Figures 5 and 6 illustrate that our proposed strategy, which includes eliciting response set rankings and measuring human–decide settlement by way of a steady multi-label settlement metric (MSE) selects far more performant decide techniques than forced-choice settlement metrics. Even when beginning with an current corpus of forced-choice information, we will estimate the interpretation matrix (hat{mathbf{F}_i}) utilizing simply 100 paired forced-choice and response set rankings and nonetheless choose performant decide techniques (see paper for particulars).

Sensible Takeaways

Primarily based on our findings, we provide 4 concrete suggestions for bettering meta-evaluation:

1. Totally specify binary score duties by including a Possibly or Tie choice. This straightforward change eliminates the identifiability problem described above by making a one-to-one correspondence between forced-choice choices {Sure, No, Possibly} and response units {{Sure}, {No}, {Sure,No}}. Word: this strategy solely works for binary duties—score duties with three or extra choices can’t be absolutely specified this manner.

2. Use response set elicitation when accumulating new datasets. When it’s not attainable to completely remove indeterminacy (which is frequent for properties like helpfulness or relevance), acquire response set rankings the place raters choose ALL choices which are affordable. Then, measure settlement utilizing a steady multi-label metric like MSE. This preserves important details about score indeterminacy that forced-choice elicitation eliminates.

3. Gather small auxiliary datasets to reinforce forced-choice rankings. Have already got forced-choice information? Gather simply ~100 paired forced-choice and response set rankings to estimate the interpretation matrix (hat{mathbf{F}}). Our experiments present this small funding permits a lot better decide choice (Discovering 3 above). Try our GitHub tutorial for implementation particulars.

4. Should you should use forced-choice, select distributional metrics rigorously. Our outcomes persistently present KL-Divergence within the human→decide path (not decide→human) performs greatest amongst forced-choice human–decide settlement metrics. Keep away from categorical metrics like Hit-Price, that are unreliable underneath score indeterminacy.

Wish to be taught extra or do that strategy out for your self? Discover our implementation and quickstart tutorial on GitHub!

Acknowledgements: This weblog submit relies on our NeurIPS 2025 paper Validating LLM-as-a-Choose Programs underneath Ranking Indeterminacy, co-authored with Solon Barocas, Hannah Wallach, Kenneth Holstein, Steven Wu, and Alexandra Chouldechova. Many due to my co-authors and to members of the Sociotechnical Alignment Middle (STAC) at Microsoft Analysis for invaluable suggestions on early drafts of this work. Moreover, many due to Wayne Chi and Kiriaki Fragkia for useful suggestions on earlier variations of this weblog submit.