Justifying the Score or Informing the Stakeholder? Transparency Challenges in Large-Scale Language Testing

Abstract

Language testing is both an evaluative practice and a commercial enterprise shaped by market forces. Within this context, test developers have a responsibility to ensure transparency with test users, particularly when scores inform high-stakes decisions. This Viewpoint contrasts transparency in communicating the truth about language tests with the persuasiveness of argument-based validity, noting that persuasiveness, although central to such arguments, is not equivalent to transparency. Two measures are proposed to strengthen transparency. First, test developers should publish test-form-specific validity reports detailing content, psychometric properties, and interpretation guidelines for each form. Second, they should clearly explain a test’s limitations to the public, especially when scores are used in high-stakes settings, such as immigration or university admission, without adequate validation. The latter measure draws on regulatory norms in the pharmaceutical industry, where transparency can protect consumers from potential misuse. Specific steps are outlined to support these measures. Overall, these proposals aim to shift the emphasis from justification and persuasion toward transparency and align language testing practices more closely with open science principles.

Keywords

large-scale testing open science test form transparency validity

Introduction

In this Viewpoint, I aim to draw attention to a critical aspect of transparency and open science (OS) that extends beyond Winke’s (2024) Viewpoint and several responses published in Language Testing’s Special Issue on Open Science (Kremmel & Isbell, 2024). While these contributions discussed test-taking individuals from different perspectives, there is room for expanding this framework by emphasizing transparency in argument-based validity frameworks designed to support the use and interpretation of test scores, as well as the evaluation of their consequences. In particular, this expansion should foreground the role of the test-taker, a key stakeholder in language testing (Jin, 2022). I put forward two specific proposals to promote greater transparency and to address certain limitations in current argument-based validity and/or Assessment Use Argument (AUA) frameworks (e.g., Bachman & Palmer, 2010; Chapelle et al., 2008; Kane, 2013) as applied to language test development and validation. The first proposal outlines specific requirements for data analysis and reporting to ensure transparency at the level of each test form, while the second offers a practical approach to promoting transparency and clarity in score reporting. Both proposals apply primarily to large-scale language tests whose results inform consequential decisions, such as those related to admission, immigration, or employment. However, some of the arguments presented may also be relevant to other types of tests, particularly where issues of transparency and validity evidence are concerned.

First, I urge developers of language tests, particularly those used for high-stakes decisions, to publish validity evidence, whatever is already available, for each specific test form, with that validity evidence clearly tied to the cohort, administration conditions, and intended uses associated with the form in question. As Bachman and Palmer (2010) have stressed, “an AUA is specific to a particular assessment situation and the process of assessment justification is local and relevant to that situation” (pp. 95–96). Although they do not explicitly discuss parallel test forms, their emphasis on situational specificity suggests that each operational administration of a test form, defined by its cohort, administration conditions, and score uses, constitutes a distinct assessment situation that requires context-bound evidence. I argue that this requirement is motivated by empirical evidence showing substantial variability in test content across different test forms, as demonstrated through corpus-based research, and the corresponding variability that can be expected in test functionality and associated statistics (Green et al., 2010, p. 207; Nishizawa, 2024; Zhao & Aryadoust, 2024). Making validity evidence for each test form available would allow stakeholders to examine the basis for score interpretation in specific contexts, which is consistent with the spirit of openness in research. However, to my knowledge, no existing validation approach fully achieves these aims. This is unsurprising, as contemporary validation approaches are designed to provide practical frameworks for organizing and evaluating evidence in assessment situations, often with the goal of convincing or persuading test users (e.g., Bachman & Palmer, 2010).

Given that individual language tasks or test forms may function differently across specific contexts, evidence gathered for one form, test-taker population, or administrative setting cannot automatically be applied to others. In practice, however, published validity arguments often follow two problematic patterns: they are either derived from separate test forms administered to different cohorts that yield distinct validity inferences—such as evidence supporting the “scoring inference” from one test form combined with evidence for another inference from a different form, all aggregated into a single validity argument (e.g., Chapelle et al., 2008; Wang et al., 2012)—or they involve generalizing findings from a single test form to others constructed under similar specifications (see Aryadoust, 2013). Because these test instruments are administered to distinct cohorts in varying contexts, aggregating validity evidence across test forms may mask or limit the comparability and generalizability of score inferences. Accordingly, test validation efforts should pay greater attention to potential variations in content, psychometric quality, and score uses across individual test forms, rather than relying solely on aggregated evidence.¹ I will offer specific suggestions on the type of form-specific validity evidence that test developers should present to improve transparency and clarity.

This brings me to the second proposal, which concerns the need for greater clarity and transparency in score reporting. I argue that a test validation framework guided by the OS principle of transparency can provide safeguards against the misuse or misinterpretation of test scores. In this context, transparency means making the basis for score interpretations publicly available, ensuring that this information is accessible to non-specialist stakeholders, and openly acknowledging the limitations and potential misuses of the test. By contrast, the validity frameworks widely used in language assessment do not necessarily serve the interests of test users and may align more closely with the priorities of test developers. For example, the central premise of the AUA framework is to “justify” the uses of scores and “convince” the test user, in much the same way as a lawyer builds “a legal case to convince a judge or a jury.” As Bachman and Palmer (2010, p. 95) state,

[t]he lawyer’s purpose in presenting her case is to convince the judge or the jury that her client is innocent. Similarly, the process of assessment justification consists of building a “case” that the intended uses of the assessment are justified.

This point is noteworthy because it underscores that the AUA framework centers on persuading test users that test use is defensible, rather than on foregrounding the conditions, uncertainties, and limitations within which that defensibility applies.

As a result, test developers are advised to build an AUA “that stakeholders find convincing” (Bachman & Palmer, 2010, p. 95). This approach fundamentally positions the test developer as someone building a case to justify and defend the uses of test scores before an imagined challenger, with the ultimate aim of ensuring that the developer’s position prevails (see Aryadoust, 2023). I argue that this approach cannot necessarily safeguard transparency and epistemological integrity,² as it prioritizes persuasive reasoning over the open disclosure of test limitations, assumptions, and uncertainties. This is what Winke (2024) and special issue respondents to her Viewpoint identify as critical (e.g., Isbell & Kremmel, 2024; Koizumi et al., 2024). I, therefore, propose that, to promote greater transparency, test developers should clearly communicate the limitations of their tests and how scores should be interpreted, both in the reports sent to institutions that receive scores and on their official websites, where this information is publicly accessible. This would represent a significant shift from justification toward transparent disclosure.

In sum, the two proposals aim to confront the ongoing challenge of accumulating validity evidence. In addition, they seek to improve transparency around the intended uses and potential misuses of test scores.

Improving Transparency Practices in Language Testing

In response to Winke (2024), authors expressed uncertainty about whether developers of large-scale proficiency tests can do more to enhance transparency (e.g., Papageorgiou, 2024). Clark and Bruce (2024) argued that trust in such contexts should rest on robust, peer-reviewed evidence rather than reputation and that claims, such as “equivalence” or “alignment to existing frameworks” should be open to scrutiny (p. 873). At the same time, they cautioned that transparency must be balanced against commercial, security, and practical constraints to avoid unintended consequences, such as the misuse of openly shared data. Similarly, LaFlair (2024) emphasized that OS is a multifaceted movement with goals that extend beyond transparency and accessibility to include infrastructure development and alternative impact measures. He underscored that effective implementation “needs community” and “has costs” (p. 867), noting that collaboration between industry and academia, as seen in large-scale open corpora projects, can promote trust but requires sustained investment in data anonymization, access protocols, and outreach to multiple audiences.

While I recognize that developers of some existing large-scale language tests have made notable efforts to maintain a certain level of transparency through the dissemination of their research, I will now elaborate on the two points introduced earlier, that is, test-form-specific reports and arguments along with clarity and transparency in score reporting, which are often absent from validity documents published by such test developers, and propose strategies for addressing them. I then propose strategies for addressing these points as potential avenues for improving transparency in a constructive and practicable way.

Test-Form-Specific Reports and Arguments

In the context of OS, a key area for improving transparency is how validity evidence, when applicable, is collected and shared for specific test forms. Here, test forms refer to versions of a test developed from the same specifications but differing in their items, content, or tasks. Differences between test forms can have real-world consequences. Consider, for example, a large-scale assessment in which one form includes culturally neutral passages, while another contains culturally specific references unfamiliar to certain test-taker groups. If the latter group scores markedly lower, not because of weaker language ability but because of the cultural content of the test, this would illustrate how, even when forms follow identical specifications, variations in content can disadvantage some test-takers and create inequities that may go unnoticed unless form-specific evidence is collected and reported. This concern is supported by corpus-based research in language assessment, which has documented significant variation in the content of different test forms developed based on the same set of test specifications (Green et al., 2010; Tao & Aryadoust, 2024; Zhao & Aryadoust, 2024).

A common method for gathering content validity evidence involves analyzing the linguistic features of multiple test forms to identify patterns, (in)consistencies, or gaps. While such analyses often draw on corpus-based methods, they can also include expert judgment studies, specification-item congruence reviews, and other approaches to evaluating content comparability across parallel forms. In corpus-based work, this typically involves constructing a corpus from multiple test forms and comparing it with a reference corpus representing a target language use (TLU) domain or an alternative test (e.g., Green et al., 2010; Nishizawa, 2024; Zhao & Aryadoust, 2024), thereby providing an overview of linguistic content and structural variation across forms. A range of lexico-grammatical and discourse features has been employed in these comparisons, drawing on established approaches such as Biber’s multi-dimensional analysis of co-occurring grammatical patterns (Biber, 1992; Biber & Larsson, 2025), text complexity measures based on linguistic feature co-occurrence (Crossley et al., 2014), and semantic feature analyses that quantify meaning-related properties of texts (e.g., Rayson, 2008).

However, it should be noted that individual forms may not always reflect the broader corpus, align closely with a reference corpus, or share key characteristics with other forms in the same test series. While the extent to which such monitoring is carried out varies across contexts, including large-scale test developers, governments and ministries of education, and other assessment bodies, published information on how content comparability is tracked and reported remains limited. The variation in content across test forms, as documented in prior research (e.g., Green et al., 2010, p. 207; Liu et al., 2022; Nishizawa, 2024; Zhao & Aryadoust, 2024), underscores that even when test forms share certain similarities, they often differ significantly from one another and from their reference corpus. These differences raise the question of whether content validity evidence derived from one or a few forms can legitimately be generalized to all forms produced under the same specifications.

Some level of variability in the content of different test forms is natural (e.g., Tao & Aryadoust, 2024), even with advances in automated test generation (Aryadoust et al., 2024). For example, as Tao and Aryadoust (2024) have shown, co-occurring linguistic features, which refer to clusters of features that tend to appear together across texts or load on a single factor or component, can vary substantially across test forms, even when other content controls are in place. Given the range of possible variation, I do not mean to imply that all test forms must be identical in their linguistic and content features to be considered comparable. However, it is essential to assess and transparently report the extent to which the tasks and language in each specific test form reflect the communicative demands, discourse features, and functional requirements of the TLU contexts the test is intended to represent. To address this concern, I propose that test developers publish detailed analyses for each individual test form following every administration to every cohort of test-takers. To streamline this process, a standardized framework should be established and adopted, incorporating at least the following key elements (not necessarily in the order presented):

(a) Detailed psychometric properties of each test form, enabling evaluation of the quality and functionality of individual test items.

(b) An analysis of linguistic features and content in comparison with the TLU domain, in order to assess the relevance and authenticity of each test form. For example, mapping the co-occurring linguistic features of the test against the TLU domain can reveal where the test aligns with or diverges from that domain in terms of linguistic and semantic aspects related to test task characteristics (see Biber & Larsson, 2025, and Tao & Aryadoust, 2024, for a discussion of relevant methods).

(c) Clear guidelines for score interpretation and use, summarizing how the scores from a specific test form should and should not be interpreted and applied by stakeholders who rely on the test developers’ products (see e.g., Isaacs et al., 2017). Such guidance can help prevent misinterpretation or misuse that might lead to negative social consequences, such as inequitable access to education, employment, or immigration opportunities, and can support fairer outcomes for individuals and groups affected by testing (see the next section for further details).

Clarity and Transparency in Score Reporting

My second proposal is that test developers should balance the promotion of their tests’ strengths with a clear and equally prominent account of the limitations of their tests and scores. The goal should be to ensure that test users are fully informed about both the capabilities and constraints of the tests. This aligns with the International Language Testing Association’s (ILTA) (2024) revised Guidelines for Practice, which require test developers to explain “the proper interpretation of test results and any limitation on their accuracy.” An analogy can be drawn with the pharmaceutical industry, which operates under strict accountability and disclosure standards to protect public health.

Applying this concept to language testing, test developers should transparently outline the limitations of individual test forms, disclose the extent of these limitations, and explain how specific features of a test might influence the interpretation and use of test scores. For large-scale assessments used for consequential decision-making, such as those involving admission, immigration, or employment, this can help ensure fair and informed use of test scores.³ Including clear statements such as the following would be essential on the official websites of test developers and in the score reports shared with score users:

The scores from test form X are designed to assess [state intended constructs] and have been validated for [list validated purposes, citing specific validation studies, if available]. While these purposes are supported by empirical evidence, the scores should not be used for decisions or interpretations outside this validated scope. We strongly recommend that any uses beyond the validated purposes be supported by additional, context-specific piloting and validation evidence.

This statement should be periodically updated to caution test users against potential misuses of test scores, especially when high-stakes decisions are involved. In such cases, test developers should issue disclaimers that communicate, in an accessible way, any empirical evidence highlighting known instances of test misuse. In my view, these efforts should be coupled with a proactive research program undertaken, supported, or commissioned by test developers to identify potential misuses and misinterpretations of test scores. Another example, focused on improving transparency in score reporting to universities, is illustrated below:

The test scores from test form X assess language proficiency within the specific contexts covered by this test and should be considered as one indicator of academic language readiness. While designed to assess academic language skills, these scores may not capture the full range of language competencies that a student will draw upon in the specific academic environment at University X.

These transparency statements, disclaimers, and ongoing validation efforts should be featured on test developers’ websites and in reports shared with institutions that use the scores. Such efforts can also be extended to local and researcher-made assessments that have the potential to be used beyond their original contexts. For example, Isaacs et al.’s (2017) Second Language English Comprehensibility Global and Analytic Scales, developed as a formative tool, explicitly state that “the scale should not be used for decision-making that is likely to have consequences on test-takers’ lives. It is solely intended for descriptive purposes to enhance teaching and learning” (p. 2). Measures like these can prevent score users from making incorrect assumptions or decisions based on misunderstandings of what the scores and the test actually represent and of the specific contexts for which they have been validated. Finally, my intention is not to suggest that test developers can control how every test user applies test information, but rather that they can make limitations and intended uses clearly and prominently available through their own official reporting channels.

Conclusion

In sum, Winke (2024) and the follow-up letters to the editor in the 2024 special issue of Language Testing (e.g., Kremmel & Isbell, 2024) point to important principles of OS that can be further extended to improve transparency and accountability in language testing across all contexts. The two proposals outlined above aim to strengthen transparency and shift the emphasis from persuading test users to adopt test products toward clearly articulating the scope and limits of their use. While existing validity frameworks are widely regarded as useful tools for collecting data and building evidence in validation research, it is important to recognize that evidence drawn from particular test forms and contexts cannot be automatically applied beyond them. Without close examination and transparent disclosure of form-specific validity evidence, test developers risk reaching conclusions that fall short of fully meeting the principles of clarity and transparency central to OS, particularly when those conclusions inform high-stakes decisions about test-takers in contexts that deviate from intended test uses. Ultimately, the two proposals advanced in this Viewpoint invite careful reasoning and discussion. Regardless of how future conversations may evolve, it is crucial to move toward an approach that offers greater transparency at the test form level.

Footnotes

Acknowledgements

I would like to thank Xun Yan and Talia Isaacs, co-editors of Language Testing, as well as Andrea Révész for their insightful comments on earlier versions of this paper. I take full responsibility for any remaining limitations. Some sentences in this paper were revised with the assistance of large language models to improve clarity.

ORCID iD

Vahid Aryadoust

Author Contributions

Vahid Aryadoust: Conceptualization; Investigation; Methodology; Validation; Writing—original draft; Writing—review & editing.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

References

Aryadoust

(2013). Building a validity argument for an English proficiency test of listening. Cambridge Scholars Publishing.

Aryadoust

(2023). The vexing problem of validity and the future of second language assessment. Language Testing, 40(1), 8–14. https://doi.org/10.1177/02655322221125204

Aryadoust

Zakaria

Jia

(2024). Investigating the affordances of OpenAI’s large language model in developing listening assessments. Computers and Education: Artificial Intelligence, 6, 100204. https://doi.org/10.1016/j.caeai.2024.100204

Bachman

L. F.

Palmer

A. S.

(2010). Language assessment in practice. Oxford University Press.

Biber

(1992). On the complexity of discourse complexity: A multidimensional analysis. Discourse Processes, 15(2), 133–163. https://doi.org/10.1080/01638539209544806

Biber

Larsson

(2025). Accounting for the entire system of complexity features: Evidence for general oral versus literate grammatical complexity dimensions. Corpus Linguistics and Linguistic Theory. Advance online publication. https://doi.org/10.1515/cllt-2025-0017

Chapelle

C. A.

Enright

M. K.

Jamieson

J. M.

(Eds.). (2008). Building a validity argument for the Test of English as a Foreign Language™. Routledge.

Clark

Bruce

(2024). Open science should be welcomed by test providers but grounded in pragmatic caution: A response to Winke. Language Testing, 41(4), 872–876. https://doi.org/10.1177/02655322231223105

Crossley

S. A.

Allen

L. K.

McNamara

D. S.

(2014). A multi-dimensional analysis of essay writing: What linguistic features tell us about situational parameters and the effects of language functions on judgments of quality. In Sardinha

T. B.

Pinto

M. V.

(Eds.), Multi-dimensional analysis, 25 years on: A tribute to Douglas Biber (pp. 197–237). John Benjamins. https://doi.org/10.1075/scl.60.07cro

10.

Green

Ünaldi

Weir

(2010). Empiricism versus connoisseurship: Establishing the appropriacy of texts in tests of academic reading. Language Testing, 27(2), 191–211. https://doi.org/10.1177/0265532209349471

11.

International Language Testing Association. (2024). ILTA guidelines for practice in English. https://www.iltaonline.com/page/ILTAGuidelinesforPractice

12.

Isaacs

Trofimovich

Foote

J. A.

(2017). Second language English comprehensibility global and analytic scales (Version 1.0). IRIS Database. http://www.iris-database.org

13.

Isbell

D. R.

Kremmel

(2024). Open science and the next generation of language testing research. Language Testing, 41(4), 898–908. https://doi.org/10.1177/02655322241264082

14.

Jin

(2022). Test-taker insights for language assessment policies and practices. Language Testing, 40(1), 193–203. https://doi.org/10.1177/02655322221117136

15.

Kane

M. T.

(2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000

16.

Koizumi

Maie

Yanagisawa

In’nami

(2024). Considerations to promote and accelerate open science: A response to Winke. Language Testing, 41(4), 892–897. https://doi.org/10.1177/02655322241239379

17.

Kremmel

Isbell

D. R.

(2024). Open science practices in language assessment: Introducing the special issue. Language Testing, 41(4), 697–702. https://doi.org/10.1177/02655322241264092

18.

LaFlair

G. T.

(2024). An industry perspective on Open Science: A response to Winke. Language Testing, 41(4), 865–871. https://doi.org/10.1177/02655322241261716

19.

Liu

Aryadoust

Foo

(2022). Examining the factor structure and its replicability across multiple listening test forms: Validity evidence for the Michigan English Test. Language Testing, 39(1), 142–171. https://doi.org/10.1177/02655322211018139

20.

Nishizawa

(2024). Authenticity of academic lecture passages in high-stakes tests: A temporal fluency perspective. Language Testing, 41(4), 792-816. https://doi.org/10.1177/02655322241262453

21.

Papageorgiou

(2024). Can language test providers do more to support Open Science? A response to Winke. Language Testing, 41(4), 860–864. https://doi.org/10.1177/02655322241232361

22.

Rayson

(2008). From key words to key semantic domains. International Journal of Corpus Linguistics, 13(4), 519–549. https://doi.org/10.1075/ijcl.13.4.06ray

23.

Tao

Aryadoust

(2024). A multidimensional analysis of a high-stakes English listening test: A corpus-based approach. Education Sciences, 14(2), 137. https://doi.org/10.3390/educsci14020137

24.

Wang

Choi

Schmidgall

Bachman

L. F.

(2012). Review of Pearson Test of English Academic: Building an assessment use argument. Language Testing, 29(4), 603–619. https://doi.org/10.1177/0265532212448619

25.

Winke

(2024). Sharing, collaborating, and building trust: How Open Science advances language testing. Language Testing, 41(4), 845–859. https://doi.org/10.1177/02655322231211159

26.

Zhao

Aryadoust

(2024). An automatized semantic analysis of two large-scale listening tests: A corpus-based study. Language Testing, 42(3), 312–343. https://doi.org/10.1177/02655322241288598