Agreement and Alignment in Binary Rating Tasks: Strategic Convergence as an Equilibrium Outcome

Abstract

Agreement between raters does not, by itself, show how that agreement is produced. In binary judgement tasks, raters may classify cases similarly either because they rely on similar decision thresholds or because those thresholds operate under favourable marginal conditions. The present study examined whether convergence in raters’ decision thresholds must be imposed exogenously or can emerge endogenously through strategic adjustment. Within an equal-variance signal-detection framework, raters were modelled as choosing thresholds that balance expected classification accuracy against an incentive to align with the rest of the panel. Equilibrium thresholds were obtained by Nash best-response updating, and threshold variance, the Strategic Convergence Index (SCI), and mean pairwise Cohen’s $κ$ were examined as functions of alignment pressure and prevalence. Increasing alignment pressure produced a sharp collapse in threshold dispersion and drove SCI towards 1 under the chosen normalisation, whereas $κ$ increased only within a comparatively narrow range. When alignment pressure was held constant, SCI remained effectively unchanged across prevalence conditions, while $κ$ varied substantially. These findings indicate that convergence in decision thresholds can arise as an equilibrium outcome of strategic interaction yet remains analytically distinct from observable agreement. They also clarify the status of Cohen’s $κ$ in this setting: it tracks agreement at the level of realised classifications and therefore remains sensitive to marginal conditions, whereas SCI captures convergence in the latent decision structure from which those classifications arise. SCI is thus not proposed as a replacement for agreement coefficients but as a model-based quantity that makes explicit a form of alignment that $κ$ only partially reflects.

Keywords

inter-rater reliability signal-detection theory Nash equilibrium Monte Carlo simulation

Introduction

Whether raters arrive at similar classifications is not the same question as how such similarity is generated. Recent signal-detection work has clarified that observable agreement and convergence in raters’ decision thresholds are analytically distinct properties of a binary rating system (Gianeselli, 2026). The question left open is whether such threshold convergence must be imposed as part of the model or can emerge endogenously through strategic adjustment among raters. The present study addresses that question. This issue matters because inter-rater reliability in binary judgement tasks is still commonly evaluated through chance-corrected agreement coefficients, most notably Cohen’s $κ$ (Cohen, 1960), which summarises concordance in observed classifications without specifying the latent decision structure from which those classifications arise. A long-standing difficulty is that $κ$ is sensitive not only to disagreement but also to the distributional conditions under which ratings are expressed. $κ$ may be attenuated under extreme prevalence, and it may behave counter-intuitively when high observed agreement coexists with asymmetric marginals or uneven positive-rate distributions across raters (Byrt et al., 1993; Cicchetti & Feinstein, 1990; Feinstein & Cicchetti, 1990). These well-known $κ$ paradoxes make observed agreement an unreliable guide to the underlying degree of alignment in raters’ decision criteria. Accordingly, changes in $κ$ cannot by themselves establish whether raters have become more closely aligned in the thresholds they apply or whether the same decision policies are simply being expressed under different marginal conditions.

Signal-detection theory (SDT) offers a framework for making this distinction explicit because binary judgements can be modelled as decisions generated by comparing latent evidence with rater-specific thresholds (DeCarlo, 1998; Green & Swets, 1966). In that setting, heterogeneity in observed agreement can be examined in relation to heterogeneity in the decision criteria applied by raters. This perspective is reinforced by work showing that decision criteria are not fixed observer properties but may be learned and adjusted as a function of task structure, reward contingencies, and response goals (Maddox, 2002). Signal-detection research has also challenged the assumption that criterion placement is perfectly stable, pointing instead to the possible contribution of criterion noise to observed responding (Benjamin et al., 2009). In the present context, however, the focus is not on trial-level variability in criterion placement but on the degree of convergence among raters in their operative decision thresholds at the level of decision policy. The Strategic Convergence Index (SCI) was proposed as a model-based summary of convergence in those thresholds (Gianeselli, 2026). Specifically, SCI is not introduced here as a new agreement coefficient but as a model-based quantity that helps clarify what changes in Cohen’s $κ$ do and do not imply about the underlying alignment of raters’ decision policies.

The further question, and the one taken up here, is whether convergence in decision thresholds can be generated within the model itself. The present study addresses this issue by modelling threshold convergence as a possible equilibrium outcome of strategic adjustment in a binary diagnostic task. Raters are assumed to balance expected classification accuracy against an incentive to align their decision thresholds, so that each threshold depends on the thresholds selected by others. This formulation is motivated by two considerations. First, binary decisions are often made in social settings in which the judgements of others can influence individual responding, even though the locus of that influence may vary across tasks and models (Germar et al., 2014). Second, signal-detection approaches have previously been extended to the analysis of collective decision-making, showing that group-level judgement can be modelled formally without being reduced to isolated individual performance (Sorkin et al., 2001). Under this formulation, convergence is evaluated using a Nash best-response criterion (Nash, 1951), allowing the simulation to examine whether a shared threshold structure can arise endogenously.

The simulation examined the behaviour of threshold variance, the SCI, and mean pairwise Cohen’s $κ$ as functions of alignment pressure and prevalence. It was expected that stronger alignment pressure would reduce the variance of equilibrium thresholds and increase the SCI, that mean pairwise $κ$ would also increase, though less strongly, and that, under fixed alignment pressure, the SCI would remain approximately stable across prevalence conditions whereas $κ$ would vary.

Method

Design

The Monte Carlo simulation examined whether convergence in rater decision thresholds can arise through strategic adjustment in a binary diagnostic task. Multiple raters classified cases as diseased (1) or non-diseased (0). The SCI was used as a model-based summary of convergence in rater decision thresholds and, more specifically, of alignment at the level of decision policy (Gianeselli, 2026). In this formulation, threshold variation was allowed to emerge endogenously from strategic adjustment rather than being imposed exogenously as a design feature. The simulation therefore tested whether the distinction between observable agreement and latent decision-policy alignment remained visible under strategically generated convergence.

Four hypotheses were tested. Hypotheses 1 (H1) stated that increasing alignment pressure ( $λ$ ) would reduce the variance of equilibrium thresholds. Hypotheses 2 (H2) stated that increasing $λ$ would increase SCI. Hypotheses 3 (H3) stated that mean pairwise Cohen’s $κ$ would increase as $λ$ increased, but to a lesser extent than SCI. Hypotheses 4 (H4) stated that, under fixed alignment pressure, SCI would remain approximately stable across prevalence conditions, whereas mean pairwise Cohen’s $κ$ would vary with prevalence.

Generative Model

Each simulated case had a binary latent state, $Y_{n} \in {0, 1}$ , where $Y_{n} = 1$ denoted the presence of the target condition and $Y_{n} = 0$ its absence. Prevalence was defined as

\begin{matrix} p = \Pr (Y = 1) . \end{matrix}

(1)

In the baseline condition, $p = . 50$ . In the prevalence analysis, $p$ was varied from .05 to .95 across 15 equally spaced values. This manipulation was included to test whether the dissociation between agreement and alignment persisted across changes in the marginal distribution of cases. The evidential structure followed an equal-variance signal-detection model. For rater $i$ , latent evidence was assumed to arise from one of two Gaussian distributions,

X ∣ Y = 0 ~ N (μ_{0} + b_{i}, σ^{2}), X ∣ Y = 1 ~ N (μ_{1} + b_{i}, σ^{2}) .

(2)

The baseline parameters were $μ_{0} = 0$ , $μ_{1} = 1.5$ , and $σ = 1$ . The term $b_{i}$ represented a rater-specific bias parameter, drawn independently as

b_{i} ~ N (0, σ_{b}^{2}) .

(3)

In the main analyses, $σ_{b} = 0.5$ . In the robustness analysis, $σ_{b}$ took the values 0.2, 0.5, and 1.0. The parameter $b_{i}$ entered the expected-accuracy component of the utility function and thereby influenced each rater’s optimal threshold. Conceptually, $b_{i}$ serves to introduce rater-specific heterogeneity into the optimisation problem, so that any subsequent reduction in threshold dispersion can be attributed to strategic alignment rather than to identical underlying decision problems. Equilibrium thresholds were obtained through iterative best-response updating under alignment pressure $λ$ . Binary ratings were then generated from rater-specific evidence draws sampled from the common class-conditional distributions defined by $μ_{0}$ , $μ_{1}$ , and $σ$ . Threshold placement was therefore endogenous to strategic interaction, whereas observed agreement was evaluated under a shared distributional structure rather than a common realised evidence draw.

Strategic Decision Model

Each rater selected a decision threshold $t_{i}$ on the bounded interval $[a, b] = [- 4, 5]$ . Given prevalence $p$ , evidence parameters $μ_{0}$ , $μ_{1}$ , and $σ$ , and bias $b_{i}$ , expected classification accuracy for rater $i$ was

\begin{matrix} A_{i} (t_{i}) = p [1 - Φ (\frac{t_{i} - (μ_{1} + b_{i})}{σ})] + (1 - p) Φ (\frac{t_{i} - (μ_{0} + b_{i})}{σ}) . \end{matrix}

(4)

here, $Φ (\cdot)$ denotes the standard normal cumulative distribution function. The first term represents the prevalence-weighted hit rate, and the second term represents the prevalence-weighted correct-rejection rate. This specification treats criterion placement as an object of optimisation under task constraints, in line with work showing that decision criteria may be adjusted as a function of performance contingencies rather than treated as fixed observer properties (Maddox, 2002).

Strategic adjustment was introduced through a quadratic penalty on deviation from the mean threshold of the remaining raters. Let

{\bar{t}}_{- i} = \frac{1}{K - 1} \sum_{j \neq i} t_{j}

(5)

denote the mean threshold of all raters other than $i$ . Rater $i$ ’s utility was then defined as

U_{i} (t_{i}; {\bar{t}}_{- i}) = A_{i} (t_{i}) - λ {(t_{i} - {\bar{t}}_{- i})}^{2}, λ \geq 0 .

(6)

This utility function represents the trade-off between individual diagnostic accuracy and an incentive to align the decision criterion with that of the panel. When $λ = 0$ , each rater maximised expected accuracy alone. As $λ$ increased, deviation from the current group mean became increasingly costly. The simulation used a Nash best-response condition (Nash, 1951) as an operational equilibrium criterion. In the present setting, an equilibrium threshold profile $t^{*} = (t_{1}^{*}, \dots, t_{K}^{*})$ satisfied

t_{i}^{*} \in \arg \max_{t_{i} \in [a, b]} U_{i} (t_{i}; {\bar{t}}_{- i}^{*}), i = 1, \dots, K .

(7)

Equilibrium thresholds were approximated by asynchronous best-response updating. At the start of each replication, each $t_{i}$ was drawn independently from a uniform distribution on $[- 1, 2]$ . Raters were then updated sequentially. For rater $i$ , the current best response was obtained by one-dimensional numerical optimisation:

\begin{matrix} B R_{i} (t_{- i}) = \arg \max_{t_{i} \in [a, b]} U_{i} (t_{i}; t_{- i}) . \end{matrix}

(8)

The threshold vector was updated in place, so that each rater responded to the most recently available thresholds within the same iteration cycle. Iteration stopped when

\max_{i} | t_{i}^{(m + 1)} - t_{i}^{(m)} | < 10^{- 6} .

(9)

or when 1,000 iterations had been reached. The final threshold vector was retained as the simulated equilibrium profile. For each replication, the code stored convergence status, number of iterations, and final maximum update size. After the equilibrium threshold profile had been obtained, a binary rating matrix was generated. Each replication contained $N = 800$ cases. For each case,

\begin{matrix} Y_{n} ~ Bernoulli (p) . \end{matrix}

(10)

Conditional on $Y_{n}$ , a new evidence value was generated independently for each rater:

X_{i n}^{obs} | Y_{n} ~ N (μ_{Y_{n}}, σ^{2}),

(11)

where $μ_{Y_{n}} = μ_{0}$ if $Y_{n} = 0$ and $μ_{Y_{n}} = μ_{1}$ if $Y_{n} = 1$ . The observed binary decision was

D_{in} = I (X_{in}^{obs} > t_{i}^{*}) .

(12)

The bias parameter affected the utility stage only and did not enter the realised evidence draws used to generate observed ratings. Thus, between-rater heterogeneity influenced the strategic formation of thresholds, whereas observed agreement was evaluated under a shared class-conditional distributional structure rather than a common realised evidence draw.

Outcome Measures

Dispersion of equilibrium thresholds was quantified by the sample variance

Var (t^{*}) = \frac{1}{K - 1} \sum_{i = 1}^{K} {(t_{i}^{*} - \bar{t^{*}})}^{2},

(13)

where ${\bar{t}}^{*}$ is the mean equilibrium threshold. This was the primary process-level measure of heterogeneity in decision criteria. SCI was computed as

\begin{matrix} SCI = 1 - \frac{Var (t^{*})}{Va r_{\max}} . \end{matrix}

(14)

where

\begin{matrix} Va r_{\max} = \frac{(b - a)^{2}}{12} . \end{matrix}

(15)

with $a = - 4$ and $b = 5$ , $Va r_{\max}$ is the variance of a uniform distribution over the admissible threshold interval and serves as a fixed reference dispersion for normalisation. SCI therefore indexes threshold convergence relative to this benchmark dispersion, rather than relative to the absolute mathematical maximum variance attainable by a bounded variable. After computation, SCI was bounded to the unit interval to avoid minor numerical artefacts. Values near 1 indicate highly concentrated threshold profiles, whereas lower values indicate greater threshold dispersion relative to the reference interval. SCI was treated as a model-based index of threshold convergence, not as a replacement for agreement coefficients (Gianeselli, 2026).

For each replication, Cohen’s $κ$ was computed for every pair of raters from the $2 \times 2$ contingency table formed by their binary decisions across the $N$ cases:

\begin{matrix} κ = \frac{P_{o} - P_{e}}{1 - P_{e}} . \end{matrix}

(16)

where $P_{o}$ is the observed agreement and $P_{e}$ is the agreement expected from the raters’ marginal response proportions. If $1 - P_{e}$ was numerically smaller than 10⁻¹⁰, the corresponding coefficient was set to missing. The primary agreement outcome was the mean pairwise $κ$ across all rater pairs.

Simulation Conditions

All analyses were conducted in R (version 4.5.2; see Appendix). The default panel size was $K = 12$ . Baseline SDT parameters were $μ_{0} = 0$ , $μ_{1} = 1.5$ , and $σ = 1$ . Thresholds were constrained to $[- 4, 5]$ . Each replication used $N = 800$ simulated cases. The random seed was fixed at 12345.

The first analysis manipulated alignment pressure across 16 equally spaced values from 0 to 5. At each value of $λ$ , 40 replications were generated under $p = . 50$ , $K = 12$ , and $σ_{b} = 0.5$ . For each design point, the simulation summarised mean threshold variance, its empirical 2.5th and 97.5th quantiles, mean SCI, mean pairwise $κ$ , convergence rate, and median iteration count.

The second analysis fixed alignment pressure at $λ = 1.5$ and varied prevalence from .05 to .95 across 15 equally spaced values. Each prevalence value was replicated 40 times. The retained outcomes were mean SCI and mean pairwise Cohen’s $κ$ . The robustness analysis crossed

λ \in {0, 1.5, 3.0}, σ_{b} \in {0.2, 0.5, 1.0}, K \in {6, 12, 20} .

(17)

This produced 27 design cells. Each cell was replicated 30 times. For each cell, the simulation retained mean threshold variance, mean SCI, mean pairwise $κ$ , convergence rate, and median iteration count.

Rationale for Parameter Choices

The parameterisation was chosen to keep the strategic extension interpretable within an equal-variance SDT setting. The values $μ_{0} = 0$ , $μ_{1} = 1.5$ , and $σ = 1$ define a moderate discrimination problem with substantial overlap between diseased and non-diseased evidence distributions. The threshold interval $[- 4, 5]$ was wide enough to avoid implausible boundary truncation under the simulated conditions and to provide a stable benchmark dispersion for SCI normalisation. The choice of $N = 800$ cases per replication reflected a compromise between Monte Carlo stability and computational tractability. The default panel size of $K = 12$ was selected as a plausible multi-rater configuration and was retained as the baseline panel size throughout the main analyses. The prevalence manipulation was included because classical agreement coefficients are sensitive to marginal distributions, whereas SCI was intended to index convergence in decision policy. The analysis therefore tested whether those two quantities remained dissociable when threshold convergence was generated strategically rather than fixed by design.

Results

Alignment Pressure, Equilibrium Thresholds, and Convergence

Increasing alignment pressure was associated with a marked reduction in the variance of equilibrium thresholds. Mean $Var (t^{*})$ declined from 0.233 at $λ = 0$ to 0.009 at $λ = 0.33$ , and then decreased further across the remainder of the $λ$ range, reaching $6.20 \times 10^{- 5}$ at $λ = 5$ (Table 1). The association between $λ$ and mean threshold variance was perfectly monotonic (Spearman’s $ρ = - 1.00$ ). Convergence was obtained in almost all replications (overall convergence rate = 0.998), with a median of 142.5 iterations.

Table 1.

Main Simulation Results Across Alignment-Pressure Conditions.

$λ$	M $Var (t^{*})$	M SCI	M pairwise $κ$	Convergence rate	Median iterations
0.00	0.233	0.965	0.250	1.000	2.0
0.33	0.009	0.999	0.292	1.000	26.0
0.67	0.003	1.000	0.296	1.000	46.5
1.00	0.001	1.000	0.294	1.000	64.5
1.33	0.001	1.000	0.299	1.000	83.5
1.67	0.001	1.000	0.296	1.000	103.5
2.00	0.000	1.000	0.296	1.000	115.5
2.33	0.000	1.000	0.294	1.000	131.5
2.67	0.000	1.000	0.297	1.000	153.0
3.00	0.000	1.000	0.299	1.000	173.0
3.33	0.000	1.000	0.292	1.000	186.0
3.67	0.000	1.000	0.297	1.000	208.0
4.00	0.000	1.000	0.297	1.000	219.5
4.33	0.000	1.000	0.296	1.000	225.0
4.67	0.000	1.000	0.297	0.975	256.5
5.00	0.000	1.000	0.297	1.000	265.0

Note. Values are Monte Carlo means across 40 replications per design point. Convergence rate denotes the proportion of replications meeting the convergence criterion within 1,000 iterations. SCI = Strategic Convergence Index.

As shown in Figure 1, most of the reduction in threshold variance occurred at low values of $λ$ , after which the curve flattened near zero. SCI increased monotonically with alignment pressure. Mean SCI was 0.965 at $λ = 0$ , 0.999 at $λ = 0.33$ , and approached 1.000 at $λ = 5$ (Table 1). The association between $λ$ and mean SCI was also perfectly monotonic (Spearman’s $ρ = 1.00$ ).

Figure 1.

Variance collapse under alignment pressure.

Mean pairwise Cohen’s $κ$ also increased with alignment pressure, but over a narrower range. Mean $κ$ rose from 0.250 at $λ = 0$ to values between 0.292 and 0.299 across the positive $λ$ conditions, with little further change thereafter (Table 1). H1, H2, and H3 were therefore supported.

Prevalence Effects on Agreement and Alignment

When alignment pressure was held constant at $λ = 1.5$ , SCI showed minimal variation across prevalence conditions, whereas mean pairwise Cohen’s $κ$ varied substantially (Figure 2). The standard deviation of prevalence-wise mean SCI was $3.83 \times 10^{- 5}$ , whereas the corresponding standard deviation for mean $κ$ was 0.093.

Figure 2.

Prevalence effects on agreement and alignment.

Figure 2 shows that mean $κ$ followed an inverted-U pattern across the prevalence range, with lower values at the distributional extremes and higher values near balanced class proportions, whereas SCI remained effectively constant throughout. H4 was therefore supported.

Robustness Checks

The same general pattern was observed across variation in bias dispersion and panel size (Table 2). At $λ = 0$ , larger values of $σ_{b}$ were associated with larger threshold variance, lower SCI, and lower mean pairwise $κ$ . For example, with $K = 12$ , increasing $σ_{b}$ from 0.2 to 1.0 increased mean threshold variance from 0.038 to 0.990, while decreasing mean SCI from 0.994 to 0.853 and mean pairwise $κ$ from 0.289 to 0.168.

Table 2.

Robustness Checks Across Bias Dispersion, Panel Size, and Alignment Pressure.

$λ$	$σ_{b}$	K	M $Var (t^{*})$	M SCI	M pairwise $κ$	Convergence rate	Median iterations
0.0	0.2	6	0.039	0.994	0.287	1.000	2.0
1.5	0.2	6	0.000	1.000	0.295	1.000	78.0
3.0	0.2	6	0.000	1.000	0.299	1.000	147.0
0.0	0.2	12	0.038	0.994	0.289	1.000	2.0
1.5	0.2	12	0.000	1.000	0.298	1.000	77.0
3.0	0.2	12	0.000	1.000	0.300	1.000	138.0
0.0	0.2	20	0.039	0.994	0.286	1.000	2.0
1.5	0.2	20	0.000	1.000	0.297	1.000	76.5
3.0	0.2	20	0.000	1.000	0.298	1.000	143.0
0.0	0.5	6	0.299	0.956	0.242	1.000	2.0
1.5	0.5	6	0.001	1.000	0.294	1.000	90.5
3.0	0.5	6	0.000	1.000	0.300	1.000	166.0
0.0	0.5	12	0.241	0.964	0.250	1.000	2.0
1.5	0.5	12	0.001	1.000	0.296	0.967	95.0
3.0	0.5	12	0.000	1.000	0.295	1.000	169.5
0.0	0.5	20	0.240	0.964	0.250	1.000	2.0
1.5	0.5	20	0.001	1.000	0.298	1.000	90.0
3.0	0.5	20	0.000	1.000	0.298	1.000	160.5
0.0	1.0	6	0.866	0.872	0.178	1.000	2.0
1.5	1.0	6	0.001	1.000	0.275	1.000	142.0
3.0	1.0	6	0.000	1.000	0.280	1.000	229.0
0.0	1.0	12	0.990	0.853	0.168	1.000	2.0
1.5	1.0	12	0.001	1.000	0.288	1.000	137.5
3.0	1.0	12	0.000	1.000	0.281	1.000	262.0
0.0	1.0	20	1.029	0.848	0.172	1.000	2.0
1.5	1.0	20	0.001	1.000	0.290	1.000	142.5
3.0	1.0	20	0.000	1.000	0.292	1.000	294.5

Note. Values are Monte Carlo means across 30 replications per design cell. SCI = Strategic Convergence Index.

At positive values of $λ$ , threshold variance was reduced sharply across all robustness conditions, and SCI approached 1.000 throughout. Mean pairwise $κ$ also increased relative to the corresponding $λ = 0$ conditions. Convergence rates remained high across robustness cells, reaching 1.000 in all but one condition, where the rate was 0.967.

Discussion

Convergence in rater decision thresholds can emerge endogenously as an equilibrium outcome of strategic adjustment. As alignment pressure increased, equilibrium thresholds became more concentrated, SCI moved towards the upper end of its normalised scale and mean pairwise Cohen’s $κ$ increased over a narrower range. This pattern suggests that observable agreement and latent decision-policy alignment remain distinct even when convergence arises within the model rather than being fixed in advance. In this sense, these analyses extend the earlier signal-detection framework (Table 3) by showing that convergence in thresholds may arise through strategic interaction among raters.

Table 3.

Relation Between the SCI Baseline Framework and the Strategic Convergence as an Equilibrium Outcome Model.

Component	Baseline SCI framework	Strategic convergence as an equilibrium outcome model
Core aim	To distinguish observable agreement from convergence in latent decision thresholds within an SDT-based account of inter-rater reliability.	To examine whether convergence in latent decision thresholds can arise endogenously through strategic interaction among raters.
Conceptual focus	Reliability is decomposed into outcome-level agreement and process-level threshold alignment.	The same decomposition is retained, but threshold alignment is modelled as an equilibrium outcome of interaction.
Status of threshold convergence	Threshold dispersion is specified exogenously as part of the simulation design.	Threshold dispersion is not fixed in advance but emerges endogenously under alignment pressure.
Role of SCI	SCI is introduced as a model-based summary of convergence in rater decision thresholds.	SCI retains the same interpretive role but here tracks convergence in equilibrium thresholds generated by strategic adjustment.
Role of Cohen’s $κ$	$κ$ is treated as an outcome-level statistic of realised agreement, sensitive to prevalence and perceptual discriminability.	$κ$ retains the same status but is evaluated under threshold profiles formed through strategic interaction.
Signal-detection structure	An equal-variance SDT framework links latent evidence, decision thresholds, and binary classifications.	The equal-variance SDT structure is retained and embedded in a strategic decision model.
Source of between-rater heterogeneity	Heterogeneity is represented through variation in decision thresholds.	Heterogeneity is represented through rater-specific bias terms in the utility structure and through differential equilibrium responses.
Mechanism generating thresholds	Thresholds are imposed or sampled under controlled dispersion conditions.	Thresholds are selected as best responses that trade off expected classification accuracy against alignment with the panel.
Main theoretical claim	Observable agreement and latent threshold alignment are analytically distinct and should not be conflated.	That distinction is preserved even when threshold alignment is generated within the model rather than imposed by design.
Treatment of prevalence	Prevalence is manipulated to show that $κ$ may vary while SCI remains stable.	Prevalence is manipulated to test whether the same dissociation persists under endogenously generated convergence.
Inferential scope	The framework is primarily generative, with supplementary analyses illustrating recovery of system-level properties under latent-truth uncertainty.	The framework is generative rather than inferential, focusing on equilibrium formation rather than empirical parameter recovery.
Practical interpretation	Supports a two-level reading of inter-rater reliability based on observable agreement and latent threshold alignment.	Extends that reading to settings in which raters may become aligned through repeated interaction, calibration, or shared decision environments.

Note. SCI = Strategic Convergence Index; SDT = signal-detection theory.

The behaviour of mean pairwise Cohen’s $κ$ is especially informative. Agreement increased with alignment pressure, but much less sharply than SCI, and remained sensitive to prevalence even when SCI was effectively unchanged. This implies that appreciable changes in equilibrium decision structure may be only partly visible in observable agreement. From this perspective, the so-called $κ$ paradoxes can be stated more precisely. One is the familiar case in which high observed agreement is accompanied by a comparatively attenuated $κ$ under skewed marginals; another is the marked dependence of $κ$ on prevalence even when the underlying rating process has not changed to a similar extent (Byrt et al., 1993; Cicchetti & Feinstein, 1990; Feinstein & Cicchetti, 1990). In the simulations, $κ$ varied across prevalence conditions even when alignment pressure was held constant and SCI remained effectively unchanged. By contrast, stronger alignment pressure sharply concentrated equilibrium thresholds while producing comparatively modest changes in $κ$ . Within this framework, $κ$ therefore appears to depend both on latent alignment and on the marginal conditions under which classifications are expressed. Much of the apparent inconsistency falls away once $κ$ is interpreted as an outcome-level summary of realised agreement, not as a direct measure of convergence in decision policy.

This distinction also helps situate the argument within the broader debate on alternative agreement coefficients. In applied research, the prevalence sensitivity of $κ$ has often motivated the use of statistics such as Gwet’s AC1, which can produce more stable values under skewed marginals (Wongpakaran et al., 2013). However, greater numerical stability does not by itself address the issue examined here. As Vach and Gerke (2023) noted, AC1 is not a simple substitute for Cohen’s $κ$ , because the two coefficients rely on different comparators and therefore change under different conditions. A further point follows from the present model: neither $κ$ nor its common alternatives directly represent convergence in latent decision thresholds. What is at issue, then, is not only which coefficient is more stable, but also which aspect of the rating process is being quantified.

SCI is informative in a different way. In the present model, it tracks convergence in the equilibrium distribution of decision thresholds and therefore operates at the level of decision policy, not observable agreement. Its near invariance across prevalence conditions, once alignment pressure was held constant, reflects that role: SCI captures the concentration of thresholds generated by the strategic process, not the marginal conditions under which those thresholds are expressed. For that reason, SCI is not offered as a replacement for classical agreement coefficients, nor as another candidate chance-corrected index. Its value is explanatory: it provides a model-based summary of threshold convergence and makes explicit a process-level property that remains only partly visible in outcome-level agreement statistics (Gianeselli, 2026).

Methodologically, these findings suggest that inter-rater reliability becomes more informative when observable agreement is distinguished from the latent structure of decision policies that produces it. Under the present framework, changes in agreement do not necessarily map directly onto changes in threshold convergence, and substantial convergence in decision thresholds may remain only partially visible in observed classifications. This distinction matters for the interpretation of calibration, training, and panel consistency. A rating system may appear only modestly more consistent in its outcomes while having become substantially more aligned in its underlying decision criteria. More broadly, the results support a two-level view of reliability, in which agreement coefficients summarise realised concordance, whereas model-based indices such as SCI characterise the alignment structure underlying that concordance.

Several limitations should be noted. The analysis was conducted within a stylised equal-variance signal-detection model and relied on a deliberately simple utility function in which raters traded off expected accuracy against quadratic alignment costs. These assumptions made the equilibrium structure tractable, but they do not exhaust the forms of strategic adjustment that may arise in applied settings. In addition, the Nash best-response criterion was used as an operational equilibrium device, not as the basis for a general claim about existence or uniqueness (Nash, 1951). The study is therefore generative rather than inferential: it shows how threshold convergence can emerge within a strategic framework, but it does not establish how such convergence should be estimated from empirical data under latent-truth uncertainty. That limitation is substantial, because when true class labels are unobserved, the problem becomes one of latent-structure inference rather than forward specification alone. In such settings, rater-specific parameters cannot be recovered directly from observed classifications without further modelling assumptions, and approaches such as Dawid and Skene (1979) become relevant because they treat latent truth and observer error as jointly unobserved. The present model stops short of providing that estimation framework. Future work should therefore extend the model to alternative utility structures, heterogeneous sensitivities, and estimation strategies capable of linking observed rating data to latent threshold dynamics. A particularly relevant direction would be to examine generalised Bayesian approaches to learning under misspecification, especially those that combine likelihood-based fit with decision-relevant objectives under latent uncertainty (e.g., Massari & Newton, 2026).

In sum, the present findings show that threshold convergence can emerge endogenously through strategic adjustment while remaining distinct from observable agreement. Inter-rater reliability may therefore be understood more clearly when these two levels are kept separate. Agreement coefficients describe realised classificatory concordance, whereas latent threshold alignment concerns the decision structure from which that concordance arises.

Conclusion

This study shows that linking Cohen’s $κ$ , the SCI, and a Nash best-response framework helps distinguish observable agreement from convergence in latent decision thresholds, while also modelling the strategic process through which such convergence may emerge. For psychometric research, the main implication is that alignment in raters’ decision criteria need not be treated as fixed in advance but may arise endogenously through interaction. In this setting, SCI is useful not as a replacement for classical agreement coefficients but as a model-based complement to Cohen’s $κ$ : the latter remains informative about realised agreement under marginal conditions, whereas SCI summarises the concentration of the thresholds from which those classifications arise. Read together, the two measures offer a more informative account of inter-rater reliability, particularly in calibration, training, and panel-monitoring settings in which raters are repeatedly exposed to shared decision environments. More broadly, the framework also bears on questions about how collective judgement can display surface-level agreement while remaining underdetermined at the level of the decision structures that produce it, although the focus here remains on inter-rater reliability rather than judgement aggregation in the broader sense (List, 2012).

Footnotes

Appendix

Appendix.

Minimal R Simulation Code.

# ============================================================
# Minimal R Simulation Code of Strategic Convergence
# as an Equilibrium Outcome
# ============================================================
rm(list = ls())
set.seed(12345)
library(dplyr)
library(purrr)
library(tibble)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# 1) Parameters
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
K_default <- 12
mu0 <- 0
mu1 <- 1.5
sigma <- 1
a <- -4
b <- 5
N_cases <- 800
lambda_grid <- seq(0, 5, length.out = 16)
prevalence_grid <- seq(0.05, 0.95, length.out = 15)
nrep_main <- 40
nrep_prev <- 40
nrep_robust <- 30
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# 2) SCI normalization
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Uniform-benchmark variance over the admissible threshold interval
var_max <- (b - a)^2 / 12
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# 3) Core functions
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
accuracy_i <- function(t, i, bias, p = 0.5) {
hit <- 1 - pnorm((t - (mu1 + bias[i])) / sigma)
fa <- 1 - pnorm((t - (mu0 + bias[i])) / sigma)
p * hit + (1 - p) * (1 - fa)
}
utility_i <- function(t_i, t_bar_others, lambda, i, bias, p = 0.5) {
accuracy_i(t_i, i, bias, p) - lambda * (t_i - t_bar_others)^2
}
best_response_i <- function(t_bar_others, lambda, i, bias, p = 0.5) {
optimize(
f = function(t) -utility_i(t, t_bar_others, lambda, i, bias, p),
interval = c(a, b)
)$minimum
}
simulate_equilibrium <- function(lambda, bias, p = 0.5,
max_iter = 1000, tol = 1e-6) {
K <- length(bias)
t <- runif(K, -1, 2)
converged <- FALSE
for (iter in seq_len(max_iter)) {
t_old <- t
for (i in seq_len(K)) {
t_bar_others <- mean(t[-i])
t[i] <- best_response_i(t_bar_others, lambda, i, bias, p)
}
if (max(abs(t - t_old)) < tol) {
converged <- TRUE
break
}
}
list(
t = t,
converged = converged,
iter = iter,
max_delta = max(abs(t - t_old))
)
}
compute_SCI <- function(t) {
sci <- 1 - var(t) / var_max
pmax(0, pmin(1, sci))
}
generate_ratings <- function(t, N = N_cases, p = 0.5) {
K <- length(t)
true <- rbinom(N, 1, p)
ratings <- matrix(0L, nrow = N, ncol = K)
for (i in seq_len(K)) {
evidence <- rnorm(N, Mean = ifelse(true == 1, mu1, mu0), sd = sigma)
ratings[, i] <- as.integer(evidence > t[i])
}
ratings
}
pairwise_kappa <- function(ratings) {
K <- ncol(ratings)
idx <- combn(K, 2)
kappas <- numeric(ncol(idx))
for (j in seq_len(ncol(idx))) {
a_r <- ratings[, idx[1, j]]
b_r <- ratings[, idx[2, j]]
po <- mean(a_r == b_r)
p_a <- mean(a_r)
p_b <- mean(b_r)
pe <- p_a * p_b + (1 - p_a) * (1 - p_b)
kappas[j] <- if (abs(1 - pe) < 1e-10) NA_real_ else (po - pe) / (1 - pe)
}
mean(pmax(-1, pmin(1, kappas)), na.rm = TRUE)
}
run_one <- function(lambda, p = 0.5, sd_bias = 0.5, K = K_default) {
bias <- rnorm(K, 0, sd_bias)
eq <- simulate_equilibrium(lambda, bias, p = p)
ratings <- generate_ratings(eq$t, p = p)
tibble(
lambda = lambda,
prevalence = p,
sd_bias = sd_bias,
K = K,
converged = eq$converged,
iterations = eq$iter,
var_t = var(eq$t),
sci = compute_SCI(eq$t),
kappa = pairwise_kappa(ratings)
)
}
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# 4) Main results table
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cat("\nRunning main results…\n")
pb1 <- txtProgressBar(min = 0, max = length(lambda_grid), style = 3)
main_res <- map_dfr(seq_along(lambda_grid), function(i) {
lam <- lambda_grid[i]
out <- map_dfr(seq_len(nrep_main), ~ run_one(lambda = lam))
setTxtProgressBar(pb1, i)
out
})
close(pb1)
table_main <- main_res %>%
group_by(lambda) %>%
summarise(
`Mean Var(t)` = mean(var_t),
`Mean SCI` = mean(sci),
`Mean pairwise k` = mean(kappa, na.rm = TRUE),
`Conv. rate` = mean(converged),
`Median iterations` = median(iterations),
.groups = "drop"
)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# 5) Prevalence analysis and sanity checks
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cat("\nRunning prevalence analysis…\n")
pb2 <- txtProgressBar(min = 0, max = length(prevalence_grid), style = 3)
lambda_fixed <- 1.5
prev_res <- map_dfr(seq_along(prevalence_grid), function(i) {
pv <- prevalence_grid[i]
out <- map_dfr(seq_len(nrep_prev), ~ run_one(lambda = lambda_fixed, p = pv))
setTxtProgressBar(pb2, i)
out
})
close(pb2)
sanity_checks <- tibble(
Check = c(
"Overall convergence rate",
"Median iterations",
"Spearman: lambda vs mean Var(t)",
"Spearman: lambda vs mean SCI",
"SD of prevalence-wise mean SCI",
"SD of prevalence-wise mean kappa"
),
Value = c(
mean(main_res$converged),
median(main_res$iterations),
cor(table_main$lambda, table_main$`Mean Var(t)`, method = "spearman"),
cor(table_main$lambda, table_main$`Mean SCI`, method = "spearman"),
prev_res %>%
group_by(prevalence) %>%
summarise(m = mean(sci), .groups = "drop") %>%
summarise(sd_val = sd(m)) %>%
pull(sd_val),
prev_res %>%
group_by(prevalence) %>%
summarise(m = mean(kappa, na.rm = TRUE), .groups = "drop") %>%
summarise(sd_val = sd(m)) %>%
pull(sd_val)
)
)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# 6) Robustness table
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cat("\nRunning robustness checks…\n")
robust_grid <- expand.grid(
lambda = c(0, 1.5, 3.0),
sd_bias = c(0.2, 0.5, 1.0),
K = c(6, 12, 20)
) %>%
as_tibble()
pb3 <- txtProgressBar(min = 0, max = nrow(robust_grid), style = 3)
robust_res <- map_dfr(seq_len(nrow(robust_grid)), function(i) {
rr <- robust_grid[i, ]
out <- map_dfr(seq_len(nrep_robust), ~ run_one(
lambda = rr$lambda,
sd_bias = rr$sd_bias,
K = rr$K
))
setTxtProgressBar(pb3, i)
out
})
close(pb3)
table_robustness <- robust_res %>%
group_by(lambda, sd_bias, K) %>%
summarise(
`Mean Var(t)` = mean(var_t),
`Mean SCI` = mean(sci),
`Mean pairwise k` = mean(kappa, na.rm = TRUE),
`Conv. rate` = mean(converged),
`Median iterations` = median(iterations),
.groups = "drop"
) %>%
arrange(sd_bias, K, lambda)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# 7) Printed appendix outputs
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cat("\nTable A1. Main simulation results\n")
print(table_main, n = Inf)
cat("\nTable A2. Robustness checks\n")
print(table_robustness, n = Inf)
cat("\nTable A3. Sanity checks\n")
print(sanity_checks, n = Inf)

Note. The appendix reports the minimal R simulation code. The code retains the parameterization, equilibrium routine, and Monte Carlo structure used in the main analyses, while omitting figure-generation and export routines for brevity. It reproduces the main results table, the robustness checks, and the sanity checks reported in the text. Progress bars are included only to monitor execution of the Monte Carlo loops. The complete R code, including all supplementary analyses, figures, and data export routines, is available from the corresponding author upon reasonable request.

ORCID iD

Irene Gianeselli

Ethical Considerations

Ethical approval was not required for this study, as all analyses were based on simulated data and did not involve human participants.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The complete R code used for the simulations is available from the author upon reasonable request.

Informed Consent

Informed consent was not required for this study, as no data from human participants were collected.

References

Benjamin

A. S.

Diaz

Wee

(2009). Signal detection with criterion noise: Applications to recognition memory. Psychological Review, 116(1), 84–115. https://doi.org/10.1037/a0014351

Byrt

Bishop

Carlin

J. B.

(1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46(5), 423–429. https://doi.org/10.1016/0895-4356(93)90018-v

Cicchetti

Feinstein

A. R.

(1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551–558. https://doi.org/10.1016/0895-4356(90)90159-m

Cohen

(1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. https://doi.org/10.1177/001316446002000104

Dawid

A. P.

Skene

A. M.

(1979). Maximum likelihood estimation of observer error-rates using the EM Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 20–28. https://doi.org/10.2307/2346806

DeCarlo

L. T.

(1998). Signal detection theory and generalized linear models. Psychological Methods, 3(2), 186–205. https://doi.org/10.1037/1082-989X.3.2.186

Feinstein

A. R.

Cicchetti

D. V.

(1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549. https://doi.org/10.1016/0895-4356(90)90158.l

Germar

Schlemmer

Krug

Voss

Mojzisch

(2014). Social influence and perceptual decision making: A diffusion model analysis. Personality and Social Psychology Bulletin, 40(2), 217–231. https://doi.org/10.1177/0146167213508985

Gianeselli

(2026). From agreement to epistemic alignment: A signal detection–theoretic model of inter-rater reliability. Educational and Psychological Measurement. Advance online publications. https://doi.org/10.1177/00131644261417643

10.

Green

D. M.

Swets

J. A.

(1966). Signal detection theory and psychophysics. Wiley.

11.

List

(2012). The theory of judgment aggregation: An introductory review. Synthese, 187, 179–207. https://doi.org/10.1007/s11229-011-0025-3

12.

Maddox

W. T.

(2002). Toward a unified theory of decision criterion learning in perceptual categorization. Journal of the Experimental Analysis of Behavior, 78(3), 567–595. https://doi.org/10.1901/jeab.2002.78-567

13.

Massari

Newton

(2026). Rational beliefs when the truth is not an option. International Journal of Game Theory, 55(1), 1–26. https://doi.org/10.1007/s00182-025-00976-w

14.

Nash

(1951). Non-cooperative games. Annals of Mathematics, 54(2), 286–295. https://doi.org/10.2307/1969529

15.

Sorkin

R. D.

Hays

C. J.

West

(2001). Signal-detection analysis of group decision making. Psychological Review, 108(1), 183–203. https://doi.org/10.1037/0033-295X.108.1.183

16.

Vach

Gerke

(2023). Gwet’s AC1 is not a substitute for Cohen’s kappa—A comparison of basic properties. MethodsX, 10, 102212. https://doi.org/10.1016/j.mex.2023.102212

17.

Wongpakaran

Wedding

Gwet

K. L.

(2013). A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology, 13, 61. https://doi.org/10.1186/1471-2288-13-61