Redefining Teacher Measurement Literacy in a Transforming Assessment Landscape: A Validated Three-Dimensional Framework

Abstract

While assessment literacy has long guided teacher’ assessment practice, the rise of digital assessment presents new challenges in educational measurement, necessitating a renewed focus on Teacher Measurement Literacy (TML). This study reconceptualizes TML and develops a validated framework to support teachers in this evolving context. Using the Delphi method, a three-dimensional, 14-element TML framework was refined through two rounds of expert consultation, demonstrating increased consensus. A corresponding instrument was then administered to 306 pre-service teachers. Confirmatory factor analysis supported the proposed three-dimensional structure. Furthermore, likelihood ratio tests comparing nested Rasch models indicated that a three-dimensional model provided a significantly better fit to the data than a unidimensional alternative. Cross-validation with a sample of 297 in-service teachers further supported the robustness of the multidimensional framework. These findings validate the proposed three-dimensional conception of TML and offer empirical grounding to strengthen teachers’ capacity in the context of evolving measurement practices.

Keywords

teacher measurement literacy conceptual framework Delphi method empirical validation

Introduction

Although scholars proposed teacher Measurement Literacy at an early stage (Ebel, 1961; Gotch, 2012; Lambert, 1991) to strengthen teachers’ professionalism in assessment practice, existing research has tended to treat measurement-related elements as integral components of assessment literacy (DeLuca et al., 2010; Pastore & Andrade, 2019; Popham, 2018). Assessment literacy is now conceived of as a complex interplay of different components interrelated with social, cultural, policy, professional, and experiential factors (Brookhart, 2024; Pastore & Andrade, 2019). Many definitions encompass not only micro-level measurement knowledge and skills but also macro-level educational value judgments and decision-making (Pastore & Andrade, 2019; Popham, 2018). Such an expansive conceptualization not only blurs the distinction between educational measurement and assessment (Kubiszyn & Borich, 2024) but also overlooks teachers’ knowledge and skills in measurement, thereby diminishing the value of measurement in education.

Promoting teachers’ measurement literacy is essential, as the accuracy and validity of measurement underpin effective assessment and educational decision-making (Newton & Shaw, 2014). Measurement enables teachers to engage with objective evidence, gain deeper insights into student learning, and make informed, equitable decisions (Looney, 2009; Villegas & Irvine, 2010). Teachers with strong measurement literacy can critically evaluate validity assumptions, identify bias, and interpret data contextually (Wylie & Heritage, 2024). Although many studies emphasize these competencies within assessment literacy, we prefer the term measurement literacy, as these knowledge and skills are specifically required for micro-level measurement practices.

The development of digital assessment highlights the need to reaffirm the value of teachers’ measurement literacy. Digital assessment generates richer assessment data, alongside a growing diversity of data analysis methods and models, including process data from interactive assessments of 21st-century skills (Foster & Piacentini, 2023), and real-time learning analytics feedback from digital learning platforms (von Davier et al., 2022). These data differ substantially from traditional test scores in terms of granularity, timing, and interpretive demands, requiring teachers not only to understand how data are generated, structured, and constrained by measurement models, but also to apply advanced analytical approaches, such as machine learning and data mining, to explore response processes and learning trajectories (Estaji et al., 2024). Such competencies have often been discussed under the umbrella of teacher data literacy (Gummer & Mandinach, 2015). However, data literacy alone does not fully address the designing, implementing, administering, reporting, and ethical implications embedded in digital assessment practices.

In general, measurement knowledge and skills are an essential component of teachers’ professional development (Rudner & Schafer, 2002; Schafer, 1991). As digital technologies reshape educational measurement, studies increasingly emphasize the need to revise teacher assessment literacy frameworks to place greater emphasis on measurement-related knowledge and skills (Coombe & Davidson, 2022; Estaji et al., 2024). Accordingly, we therefore argue that, alongside the concept of assessment literacy, it is necessary to foreground the concept of teachers’ measurement literacy in order to better support teachers in developing measurement-related knowledge and skills, enabling them to make fuller use of digital assessment to improve teaching and student learning.

Literature Review

Teacher Assessment Literacy

Assessment literacy has long been recognized as a core competence for teachers (Brookhart, 2011; Stiggins, 1991). Early definitions emphasized measurement theory (DeLuca et al., 2010; Mertler, 2003) and the practical knowledge, skills, and behaviors needed for effective assessment (Stiggins, 1991; Taylor, 2009). Stiggins (1991), for instance, defined advanced assessment literates “must also master test development, administration and scoring, and classical and modem psychometric theory, as well as having the experience to create large-scale assessments of demonstrated validity, reliability, and economy.” At this stage, assessment literacy was largely associated with standardized testing and classroom assessment (Pastore & Andrade, 2019). Seeking to integrate macro- and micro-level perspectives, Brookhart (2011) further redefined assessment literacy through the lens of professional teaching standards, identifying 11 statements that delineate assessment knowledge and skills of teachers. Since then, assessment literacy has evolved into a multifaceted construct embedded in teacher professional standards and preparation programs (DeLuca et al., 2016; Looney et al., 2018).

More recent scholarship situates teacher assessment literacy (TAL) within socio-cultural contexts. Willis et al. (2013) describe TAL as a “dynamic, context-dependent social practice,” in which teachers and learners negotiate classroom and cultural knowledge through assessment to achieve learning goals. This view informs later work (Estaji et al., 2024; Ye, 2022), which integrates socio-cultural factors with teacher education, producing frameworks that connect assessment to teacher preparation. Other scholars, however, maintain a measurement-oriented stance. Popham (2018) defines TAL through six theoretical concepts—including validity, reliability, and fairness—together with practical assessment procedures. Such divergence highlights the plurality of perspectives in the field (Pastore & Andrade, 2019).

Broadly, TAL can be divided into two orientations: one emphasizing measurement knowledge and skills as its core (DeLuca et al., 2010; Mertler, 2003; Popham, 2011; Stiggins, 1991), and another foregrounding the socio-cultural dimensions of assessment, which has become more prominent in recent research.

However, the expansion of online education and digital technologies has generated new demands for teachers’ assessment competencies (Timmis et al., 2016). Teachers are often expected to make assessment-informed decisions based on learning analytics dashboards that integrate both traditional and novel data in dynamic, customizable visualizations, as well as automated scoring outputs and process indicators. Scholars have called for revisions to existing TAL frameworks to better equip teachers to meet the demands of digitally mediated assessment practices and to support their everyday engagement with complex, data-rich measurement environments (Coombe & Davidson, 2022; Estaji et al., 2024). Increasing attention is now given to digital assessment literacy (Butler-Henderson & Crawford, 2020; Ibna Seraj et al., 2022; Sudakova et al., 2022; Xu & Brown, 2016) and data literacy (Gummer & Mandinach, 2015). Nevertheless, the foundational role of educational measurement remains indispensable.

Teacher Measurement Literacy

Compared with assessment literacy, teacher measurement literacy (TML) has received far less attention in educational research. Existing studies are relatively few and often outdated, with limited scholarly contributions over the past decade (Alkharusi et al., 2011; Daniel & King, 1998; Ebel, 1961; Gotch, 2012). Early work associated TML almost exclusively with standardized testing. Ebel (1961) argued that competent teachers must understand both the purposes and limitations of examinations and be able to design test items. Similarly, Gotch (2012) defined TML as the ability to use and interpret standardized tests. Other scholars emphasized teachers’ responsibility to develop, administer, analyze, and apply test data (Ebel, 1961; Lambert, 1991). Beyond these foundations, references to TML in teacher education literature have been sporadic and inconsistent (Woolfolk, 2016).

Some empirical studies have attempted to frame and measure TML. Alkharusi et al. (2011) examined TML across three dimensions—skills, knowledge, and attitudes—highlighting teachers’ ability to design, score, and apply paper-based tests. Gotch (2012) developed the Teacher Educational Measurement Literacy Scale, comprising two subscales: one assessing measurement knowledge, the other self-perceived efficacy. Findings indicate that in-service teachers often possess limited measurement knowledge and rely heavily on experience and intuition when judging student performance (Daniel & King, 1998). However, they tend to outperform pre-service teachers in practical skills and demonstrate more positive attitudes toward measurement (Alkharusi et al., 2011). Recent work has examined educational measurement coursework and related content areas aimed at building teachers’ capacity for measurement (Randall et al., 2021; Russell et al., 2019).

Advances of digital and artificial intelligence technologies has transformed educational measurement while placing greater demands on TML. Computer-based testing, now common in large-scale assessments such as PISA and TIMSS, has extended measurement beyond traditional knowledge and skills (Maity & Deroy, 2024). These tools enable the assessment of complex abilities, such as complex problem-solving, collaborative problem solving and learning in the digital world (Foster & Piacentini, 2023), creating challenges distinct from conventional testing.

Yet, most existing conceptualizations of teacher measurement literacy remain anchored in paper-based testing and traditional score interpretation. They offer limited guidance for helping teachers design digital assessment tasks, handle rich and complex assessment data, understand inference processes based on advanced models, and use digital tools to apply assessment results to improve instruction. This misalignment between current measurement practices and teachers’ professional preparation highlights the need to reconceptualize TML to better reflect the demands of digital and AI-driven assessment environments.

The Present Study

The literature shows extensive research on teacher assessment literacy, whereas work on teacher measurement literacy remains limited and underdeveloped. Although acknowledged since the 1960s (Ebel, 1961), studies of TML are fragmented and lack a cohesive framework. Most conceptualizations are either subsumed under broader assessment literacy or narrowly focused on traditional tasks such as designing and administering standardized tests (Brookhart, 2024; DeLuca et al., 2010; Mertler, 2003; Stiggins, 1991). Meanwhile, technological developments have transformed assessment practices and posed new challenges for teachers in designing new tools, analyzing and interpreting assessment data, and applying findings effectively. Against this backdrop, we argue that, alongside assessment literacy, greater attention should be paid to measurement knowledge and skills at the micro level. We therefore propose a clearer conceptualization of TML to support teachers in adapting to the ongoing transformation of assessment practices. This study aims to:

(1) Reconceptualize TML, identify its essential elements, and construct a framework;

(2) Validate the proposed TML framework using survey-based empirical data.

Study 1: Development of TML Framework

Participants

We adopted strict criteria for expert selection, requiring substantial expertise in educational measurement or teacher development, active academic influence, and diverse national and professional backgrounds. Based on these criteria, 20 experts were identified through literature and institutional review and invited via email; 16 participated, with 13 in the first round and 12 in the second, including nine Chinese university scholars who took part in both rounds. The panel comprised university scholars and teacher education professionals: 12 from China, one from Australia, two from the United States, and one from Norway. Fourteen experts held doctoral degrees and two held master’s degrees. The panel included six males and 10 females, aged 29–62 years (M = 40.4). Among the 16 experts, 13 specialized in educational measurement and three in teacher professional development.

Constructing the Initial Conceptual Framework

Current research reveals no unified understanding of TML. To address this gap, this study first examined the meanings of key terms such as assessment, educational measurement, and literacy. Based on a comprehensive literature review, we proposed an initial definition: TML refers to the knowledge, skills, and attitudes demonstrated by teachers in selecting, developing, and managing measurement tools; processing and analyzing test data; assessing student performance; and applying results to improve teaching and support student development.

Building on this definition, we integrated insights from traditional measurement literacy frameworks and professional standards (Alkharusi et al., 2011; Gotch, 2012; Randall et al., 2021), together with research on teachers’ assessment literacy in digital environments (Estaji et al., 2024), to identify the key elements of measurement literacy. These include developing, selecting, administering, and managing assessment tools; scoring; processing and analyzing data; and applying results to enhance instructional practice. In terms of framework structure, prior studies commonly emphasized knowledge, skills, and dispositions, which aligns with the KSAVE model (Binkley et al., 2012) that highlights knowledge, skills, attitudes, values, and ethics. Guided by this model, the identified elements of TML were systematically organized into three dimensions: knowledge, skills, and attitudes. This process resulted in an initial conceptual framework comprising three primary dimensions and eleven sub-dimensions.

Expert Consultation

Design and Procedure of Expert Consultation

The expert consultation questionnaire included two components. Experts first reviewed and commented on the initial definition of Teacher Measurement Literacy. They then rated the appropriateness of each framework dimension using a five-point Likert scale (1 = very inappropriate; 5 = very appropriate). Space was provided for open-ended comments and suggestions.

Two rounds of email-based consultation were conducted. In Round 1, experts evaluated the three primary dimensions (knowledge, skills, and attitudes) and their secondary dimensions and provided revision feedback. In Round 2, they re-rated the revised framework. After each round, expert ratings were analyzed to determine whether dimensions met predefined acceptance criteria; dimensions failing to meet these criteria were revised or removed. Appropriateness was evaluated using the mean score, score rate, and full score rate. Inter-expert agreement was assessed using the standard deviation and coefficient of variation (CV), with lower values indicating greater consensus. Consistent with prior research, a mean score of 3.75 and a CV threshold of 0.20 were adopted as decision criteria (Clough et al., 2016). In addition, 57 qualitative comments were coded and integrated to further refine the framework.

Results of the First-Round Consultation

The first-round consultation results (Table 1) show that the overall conceptual framework was considered highly appropriate. Except for “Measurement Statistical Models” (A2), all dimensions achieved mean scores above 3.75. Standard deviations ranged from 0.707 to 1.069, and most CV values were below or near 0.2. According to the established benchmarks, most dimensions met acceptable standards; however, expert opinions revealed notable divergence, suggesting that some dimensions and descriptions required further refinement. Experts also evaluated the overall framework and definition, and all expressed agreement, raising no objections to the three first-level dimensions.

Table 1.

First Round Results of Expert Consultation

Dimension	Key elements	Mean	SD	Full credit (%)	Percentage score (%)	CV
Knowledge	Fundamental conceptual (A1)	4.500	0.756	62.50	90.00	0.168
	Measurement statistical models (A2)	3.250	1.035	12.50	65.00	0.318
	Data processing (A3)	4.125	0.835	37.50	82.50	0.202
	Test development (A4)	4.000	0.926	37.50	80.00	0.231
	Applying test (A5)	4.125	0.991	50.00	82.50	0.240
Skills	Test selection and development (B1)	3.750	0.707	12.50	75.00	0.189
	Test administration (B2)	4.000	0.926	37.50	80.00	0.231
	Data processing and analysis (B3)	4.000	1.069	37.50	80.00	0.267
Attitudes	Measurement perceptions (C1)	4.375	0.744	50.00	87.50	0.170
	Measurement wellness (C2)	4.500	0.756	62.50	90.00	0.168
	Measurement responsibility (C3)	4.125	0.835	37.50	82.50	0.202

Based on the statistical analyses results and 57 expert comments, the framework was revised and refined. A key adjustment was distinguishing classroom assessment from large-scale assessment to clarify the measurement literacy teachers need in different contexts. Within the knowledge dimension, sub-dimension A2 was revised to “Measurement Theory and Statistical Models,” and the descriptions of A1–A5 were refined for greater precision. In the skills dimension, the original sub-dimensions B1–B3 were further disaggregated, leading to three new sub-dimensions: Scoring, Data Processing and Analysis, and Data Interpretation and Application, each representing a distinct aspect of teachers’ competencies. For the attitudes dimension, experts suggested that future research should explore valid and reliable approaches for its assessment.

Results of the Second-Round Consultation

According to the results of the second round of expert consultation (Table 2), the mean scores of all dimensions in the revised framework exceeded 4 on a 5-point scale, with percentage scores above 85%. Standard deviations ranged from 0 to 0.951, all below 1, and CV values ranged from 0 to 0.222, all below 0.3, indicating low variability and strong expert consensus. These results suggest that the revised conceptual framework was generally well accepted.

Table 2.

Results of Second Round Expert Consultation

Dimension	Key elements	Mean	SD	Full credit (%)	Percentage score (%)	CV
Knowledge	Fundamental conceptual (A1)	4.857	0.378	87.50	97.14	0.078
	Measurement theory and statistical models (A2)	4.286	0.951	50.00	85.72	0.222
	Data processing (A3)	4.429	0.535	37.50	88.58	0.121
	Test development (A4)	4.571	0.787	75.00	91.42	0.172
	Applying test (A5)	4.571	0.787	75.00	91.42	0.172
Skills	Test selection (B1)	4.714	0.756	87.50	94.28	0.160
	Test development (B2)	5.000	0.000	100.00	100.00	0.000
	Test administration (B3)	4.429	0.787	50.00	88.58	0.178
	Scoring (B4)	4.857	0.378	87.50	97.14	0.078
	Data processing and analysis (B5)	4.286	0.951	50.00	85.72	0.222
	Data interpretation and application (B6)	4.857	0.378	87.50	97.14	0.078
Attitudes	Measurement perceptions (C1)	5.000	0.000	100.00	100.00	0.000
	Measurement wellness (C2)	4.571	0.787	62.50	91.42	0.172
	Measurement responsibility (C3)	4.571	0.787	62.50	91.42	0.172

Nevertheless, several experts suggested further refinement of certain sub-dimension definitions. In response, the sub-dimension originally labeled “Measurement theories and statistical models” was renamed “Measurement theories,” and the descriptions of five sub-dimensions (A1, A5, B4, C1, C3) were revised to enhance clarity, precision, and conceptual consistency.

A comparative analysis of the two rounds of consultation revealed improvements across all indicators. The mean values increased from 3.25–4.5 to 4.28–5.0, the standard deviations decreased from 0.707–1.069 to 0–0.951, and CV values declined from 0.168–0.318 to 0–0.222. Collectively, these results indicate reduced dispersion, enhanced consensus, and stronger alignment with the reference standards established earlier (see Section 3.3.1).

Final Conceptual Framework

Regarding the results of the expert consultation, a third round was deemed unnecessary. This decision was made because the data analysis from the second round showed clear improvement compared to the first round. The relevant indicators largely met the established standards, and the revisions suggested by experts required only minor adjustments. Therefore, the revised version of the conceptual framework for TML following the second round of consultation was adopted as the final framework. A visual representation of this framework is shown in Figure 1.

Figure 1.

Conceptual framework of TML

The conceptual framework of TML developed in this study consists of three primary dimensions (knowledge, skills, and attitudes) and fourteen sub-dimensions. The framework is structured according to increasing levels of abstraction, progressing from knowledge to skills to attitudes. These dimensions are interdependent. Measurement knowledge underpins the development of skills and attitudes. Skills represent the practical application of knowledge and serve as a channel for shaping attitudes. In turn, attitudes influence the administration of skills and the deepening of knowledge. The framework also includes various forms of educational measurement and covers the full measurement process. It outlines the essential knowledge, skills, and attitudes required of teachers, from selecting measurement types and developing tools to analyzing data and applying results to improve instruction and support student learning. Detailed descriptions of each dimension are presented in Appendix Table 1.

Study 2: Empirical Validation of TML Framework

The TML framework was developed using the Delphi method to synthesize expert consensus. To strengthen empirical support, a TML survey instrument was developed and administered in two studies involving pre-service and in-service teachers in China. Empirical validation aimed to examine the theoretical structure of TML through the instrument’s psychometric properties analyses, including reliability, validity, and nested model comparisons, providing objective evidence for the proposed framework.

Survey Instrument Design and Refinement

Survey Instrument Design

Based on the proposed conceptual framework of TML, a survey instrument was developed to assess teachers’ TML. Its development drew on several established tools, including Jarr’s (2012) Self-Efficacy in Assessment Survey, Part II of Kershaw’s (1993) Professional Classroom Assessment Scale, Mertler’s (2004) Classroom Assessment Literacy Inventory and its revision, Daniel and King (1998) Educational Measurement Literacy Scale, and the Teacher Assessment Literacy Scale by Plake et al. (1993). Items from these scales were adapted to align with the framework’s dimensions.

The instrument was designed with several considerations. First, items were intended to elicit objective responses and included two types: factual knowledge questions and situational judgment items. Prior studies suggest Likert-scale items may be affected by subjective bias (Duckworth & Yeager, 2015); therefore, formats emphasizing objectivity were adopted. Knowledge items assessed measurement knowledge (e.g., “Which of the following concepts best reflects the dispersion of a data set?”). Situational items presented problem scenarios requiring students to select the most appropriate response, targeting measurement-related skills and attitudes (e.g., choosing an instructional strategy based on exam score distributions). Second, all items had a single correct answer and were scored dichotomously (0/1), allowing for analysis using both raw scores and item response theory (IRT) models. Third, item development involved extensive team discussions to ensure scenarios authentically reflected teaching and measurement practices. Following this process, the initial version comprised 52 items.

Pre-Testing

A pilot study was conducted with 50 pre-service teachers to examine response patterns and gather feedback. Results indicated that the large number of items and complex wording hindered accurate understanding, producing responses misaligned with participants’ actual situations. Accordingly, 10 items were removed, and language was simplified. For instance, “unsafe” was replaced with “dangerous,” and “incorrect” with “wrong.” Key terms such as “correct” and “wrong” were bolded and highlighted in red to enhance clarity. After revision, the final instrument included 42 items, with each of the 14 second-level dimensions represented by three items.

Instrument Refinement Based on Survey Data

A survey was conducted among pre-service teachers to further refine the instrument. Participants were undergraduate pre-service teachers, with diversity ensured across regions, universities, majors, grade levels, and teacher types. With institutional ethical approval, data were collected via an anonymous online questionnaire distributed through WeChat. Participation was voluntary and unrelated to academic evaluation. A total of 350 responses were collected, of which 306 valid questionnaires remained after data screening; the 44 excluded cases were primarily senior male students. Demographic characteristics are reported in Table 3.

Table 3.

Demographic Characteristics of Pre-Service Teachers

Variable	Categories	N	Percentage (%)
Gender	Male	175	42.8
Gender	Female	131	57.2
Grade	Sophomore	75	24.5
	Junior	157	51.3
	Senior	74	24.2
Location	Eastern region	132	43.1
	Central region	99	32.4
	Western region	61	19.9
	Northeast region	14	4.6
Major	Pedagogy	38	12.4
	Primary education	85	27.8
	Subject teaching	112	36.6
	Educational technology	67	21.9
	Other	4	1.3
Type of program^a	Government-funded normal university students	161	52.6
	National outstanding normal students	82	26.8
	Tuition-paying teacher education students	63	20.6

^aGovernment-funded normal university students who receive full tuition coverage in exchange for teaching service commitments. National Outstanding Normal Students (NONS) selected through competitive evaluation at national/provincial levels, who typically receive enhanced training and career support. Tuition-paying teacher education students in contrast to government-funded counterparts.

Reliability analysis and confirmatory factor analysis (CFA) were conducted to examine internal consistency and structural validity. As all items were dichotomously scored (0/1), McDonald’s ω was employed due to its suitability for binary data (Eisinga et al., 2013). Analyses were performed using SPSS 26 and Mplus 8.0. Iterative testing and item screening led to the removal of poorly performing items. The initial TML instrument included 42 items. However, preliminary analyses indicated suboptimal psychometric properties. The overall McDonald’s ω was 0.715, reflecting marginally acceptable reliability (Eisinga et al., 2013). Subscale ω coefficients for knowledge, skills, and attitudes were 0.374, 0.565, and 0.677, respectively, indicating inadequate internal consistency. CFA results showed acceptable χ²/df (2.67) and RMSEA (0.059), but poor incremental fit (CFI = 0.689; TLI = 0.585). Factor loadings ranged from 0.011 to 0.815, suggesting weak structural validity.

Accordingly, items with low factor loadings were removed. In total, 17 items were deleted from the knowledge dimension, 6 from skills, and 2 from attitudes, resulting in a revised 25-item instrument. Given that subsequent dimensional analyses focused on the three first-level dimensions—measurement knowledge, skills, and attitudes—we ensured that each second-level dimension was represented by at least one item, thereby preserving observational information across all second-level dimensions. Meanwhile, each first-level dimension was retained with more than three items, comprising 8 items for knowledge, 10 for skills, and 7 for attitudes.

Empirical Validation Based on the Pre-Service Teacher Survey

Using data from 306 pre-service teachers on the 25-item instrument, validation involved reliability analysis and confirmatory factor analysis (CFA), along with model comparisons between a unidimensional Rasch model and a multidimensional within-item MRCML model (Adams et al., 1997), to test whether the proposed three-dimensional structure fit the data better than a unidimensional alternative. The Rasch analyses were conducted using ConQuest 5.

Reliability Test and CFA Analysis Results

McDonald’s ω Coefficient

The overall McDonald’s ω coefficient of the TML survey instrument was 0.903, indicating strong internal consistency and excellent overall reliability across the 25 items. The knowledge dimension yielded a McDonald’s ω of 0.729, suggesting relatively low but acceptable reliability (Eisinga et al., 2013). In contrast, the skills and attitudes dimensions both demonstrated ω coefficients above 0.8, reflecting high internal consistency and good reliability.

CFA Results

The CFA model fit indices demonstrate that χ²/df was 1.55, well below the acceptable threshold of 3. The RMSEA was 0.043, meeting the criterion of less than 0.08. Both the CFI (0.976) and TLI (0.973) exceeded the recommended cutoff of 0.90, while the SRMR also satisfied the criterion (≤0.08). Collectively, these results indicate an excellent model fit, confirming that the three-dimensional conceptual model aligns well with the data. The factor loadings of the final CFA are reported in Table 4. Specifically, loadings for the knowledge dimension ranged from 0.331 to 0.925, for the skills dimension from 0.338 to 0.927, and for the attitudes dimension from 0.474 to 0.967. According to Comrey and Lee (1992), factor loadings above 0.60 are considered very good, and those above 0.328 acceptable, while values below 0.328 may suggest item removal. As all items demonstrated statistically significant loadings above 0.328, the TML scale developed in this study exhibits strong structural validity.

Table 4.

Factor Loadings of the 25-Item TMLS

Dimension	Item	Factor loadings	SE
Knowledge	1	0.335***	0.080
	2	0.359***	0.079
	3	0.648***	0.080
	4	0.559***	0.074
	5	0.925***	0.072
	6	0.705***	0.080
	7	0.498***	0.076
	8	0.331***	0.083
Skills	9	0.388***	0.073
	10	0.539***	0.065
	11	0.448***	0.069
	12	0.714***	0.053
	13	0.925***	0.027
	14	0.927***	0.028
	15	0.728***	0.052
	16	0.895***	0.033
	17	0.678***	0.056
	18	0.442***	0.071
Attitudes	19	0.474***	0.070
	20	0.657***	0.057
	21	0.642***	0.058
	22	0.906***	0.031
	23	0.967***	0.024
	24	0.841***	0.035
	25	0.817***	0.042

Note. *p < 0.1, **p < 0.05, ***p < 0.01.

Drawing upon results from both reliability analysis and CFA, the TML Survey Instrument demonstrates robust internal consistency and validity. These findings provide strong empirical support for the scientific soundness of the TML conceptual framework and substantiate its applicability in future research contexts.

Rasch Model Fitting Results

Unidimensional Rasch Model Results

The data fitting results of the unidimensional Rasch model for the 25-item survey are shown in Table 5. The model’s Final deviation was 8463.186, and the fit was statistically significant (P < 0.001), with a chi-square value of 1171.67 and 24 degrees of freedom. The overall item reliability reached 0.982, and the person separation reliability was 0.911, indicating a high level of measurement precision.

Table 5.

Item Fit Information

Item	Discrimination(a)	Difficulty(b)	SE	Weighted fit MNSQ
1	0.29	1.02	0.080	1.22
2	0.30	0.66	0.079	1.23
3	0.46	−0.19	0.080	1.10
4	0.42	0.56	0.074	1.15
5	0.51	−1.95	0.072	0.84
6	0.46	1.08	0.080	1.01
7	0.37	1.46	0.076	1.10
8	0.29	0.82	0.083	1.28
9	0.37	0.09	0.073	1.20
10	0.44	0.09	0.065	1.12
11	0.39	0.62	0.069	1.14
12	0.57	−0.42	0.053	0.93
13	0.76	−0.61	0.027	0.69
14	0.75	−0.52	0.028	0.72
15	0.57	−0.04	0.052	0.94
16	0.73	−0.59	0.033	0.71
17	0.55	0.00	0.056	0.97
18	0.40	0.20	0.071	1.17
19	0.43	−0.07	0.070	1.11
20	0.51	0.34	0.057	1.02
21	0.51	−0.20	0.058	0.99
22	0.74	−0.40	0.031	0.74
23	0.81	−0.46	0.024	0.64
24	0.65	−0.92	0.035	0.83
25	0.65	−0.56	0.042	0.85

According to Griffin et al. (2014), the item separation index can be used to assess construct validity, while the person separation index reflects criterion-related validity. Higher separation indices indicate better measurement quality. Thus, in terms of item fit, the Weighted Fit Mean Square (MNSQ) values ranged from 0.64 to 1.23, falling within the acceptable range of 0.5 to 1.5 (Bond & Fox, 2013). Item difficulty was estimated using the model’s logit scale, which spans from −4 to 4. The item difficulties ranged from −1.943 to 1.461, with an average difficulty of 0.0004, suggesting that the overall test was relatively easy.

Additionally, the ConQuest software provided item discrimination indices based on Classical Test Theory (CTT). The discrimination values ranged from 0.29 to 0.81. Most items exceeded the commonly accepted threshold of 0.3. Items 1 and 8 had discrimination values of 0.29, which are still close to the acceptable cutoff.

Three-Dimensional Modeling Results

The 25 items were analyzed using a three-dimensional Within-Item MRCML model, with item-dimension associations aligned with the factor loading structure shown in Table 4. The model yielded a Final Deviance of 8375.353 and demonstrated a statistically significant fit (χ² = 1011.7, df = 22, P < 0.001). The overall item reliability was 0.980, while the person separation reliability for the dimensions of knowledge, skills, and attitudes, respectively, was 0.797, 0.874, and 0.879, indicating high internal consistency across dimensions. The correlations between dimensions were also relatively strong: 0.730 between knowledge and skills, 0.722 between knowledge and attitudes, and 0.885 between skills and attitudes, suggesting particularly close alignment between skills and attitudes. In terms of item fit, the Weighted Fit Mean Square (MNSQ) statistics ranged from 0.65 to 1.51. Except for item x19, all items fell within the acceptable range of 0.5 to 1.5, indicating a generally good data fit model.

Model Comparison Results

The Likelihood Ratio Test (LRT) is a widely employed method for comparing nested models (Adams et al., 1997). In this study, the unidimensional Rasch model is nested within the three-dimensional Within-Item MRCML, and a likelihood ratio test (LRT) was conducted by calculating the difference in deviance between these models. As reported earlier, the unidimensional model yielded a deviance of 8463.186 with 26 estimated parameters, whereas the multidimensional model produced a lower deviance of 8375.353 with 31 estimated parameters. The resulting χ² statistic was 87.833 (χ² (5) = 87.833, p < .001), indicating that the multidimensional model offers a significantly better fit to the data than the unidimensional model. This finding offers strong empirical support for the three-dimensional conceptual framework of TML.

Cross-Validation Based on the In-Service Teacher Survey

To further verify the psychometric properties of the 25-item instrument developed from the pre-service teacher sample, a cross-validation study was conducted with 300 in-service teachers in China. Data were collected via an anonymous online questionnaire distributed through WeChat. Participation was voluntary and unrelated to academic evaluation. After data screening, 297 valid responses were retained, and demographic characteristics are reported in Table 6. As in the initial validation, the cross-validation analysis included reliability analysis, confirmatory factor analysis, and model comparisons, providing further evidence for the robustness of the measurement properties and the validity of the proposed TML theoretical structure.

Table 6.

In-Service Teachers’ Characteristics

Variable	Categories	N	Percentage (%)
Gender	Male	141	47.5
Gender	Female	156	52.5
Degree level	Associate degree or below	42	13.8
	Bachelor’s degree	209	70.4
	Master’s degree	47	15.8
Years of teaching experience	Less than 3 years	55	18.5
	4–10 years	155	38.7
	11–20 years	99	33.3
	More than 20 years	28	9.4
Professional title	Senior	97	32.7
	First-level	84	28.3
	Second-level	84	27.3
	Unranked	35	11.8
Graduated from a normal (teacher-training) university	Yes	260	87.5
Graduated from a normal (teacher-training) university	No	37	12.5
School location	Provincial capital or urban area	54	18.2
	County seat	111	37.4
	Township	73	24.6
	Rural village	59	19.9
Teaching subject type	Language and humanities	159	53.5
	Mathematics and sciences	131	44.1
	General/technical subjects	7	2.4
	Physical education and arts	0	0

McDonald’s ω Coefficient

The overall McDonald’s ω for the TML instrument was 0.901, indicating excellent internal consistency across the 25 items. At the dimensional level, the knowledge dimension showed acceptable reliability (ω = 0.746; Eisinga et al., 2013), while the skills and attitudes dimensions demonstrated good to high reliability, with ω values of 0.852 and 0.814, respectively.

CFA Results

The CFA results indicated a good model fit: χ²/df = 2.07 (<3), RMSEA = 0.060 (<0.08), CFI = 0.911, TLI = 0.900. These indices confirm that the three-dimensional TML model fits the data well. All 25 items had factor loadings above 0.5: 0.506–0.819 for knowledge, 0.653–0.964 for skills, and 0.599–0.962 for attitudes, demonstrating that the TML scale has strong structural validity.

Model Comparisons Results

For the unidimensional model, the final deviance was 4376.663, with a statistically significant fit (χ² = 2537.02, df = 24, p < 0.001). Overall item reliability was 0.989, and person separation reliability was 0.761, indicating high measurement precision. Item fit was acceptable, with weighted mean-square (MNSQ) values ranging from 0.77 to 1.20, within the recommended range of 0.5–1.5 (Bond & Fox, 2013).

For the three-dimensional model, the final deviance was 4334.721, with a statistically significant fit (χ² = 1313.79, df = 22, p < 0.001). Overall item reliability was 0.977, and person separation reliabilities for knowledge, skills, and attitudes were 0.714, 0.738, and 0.687, respectively. Item fit was generally acceptable, with weighted mean-square (MNSQ) values ranging from 0.69 to 1.28, except for item X16, indicating an overall good model-data fit.

Based on the fit results of the unidimensional and multidimensional models, a likelihood ratio test (LRT) was conducted by comparing their deviance values. The resulting χ² statistic was 41.942 (χ² (5) = 41.942, p < .001), indicating that the multidimensional model fits the data significantly better than the unidimensional model. This provides strong empirical support for the three-dimensional TML framework.

Discussion

Redefining Measurement Literacy for Assessment Transformation

Digital technologies and AI have profoundly transformed educational assessment, requiring teachers to develop more advanced measurement knowledge and skills than those emphasized in the paper-based and standardized testing. First, teachers need competencies in digital assessment design. In scenario-based, interactive assessments targeting 21st-century skills, teachers must not only understand traditional test design principles but also know how to design technology-enhanced tasks, interaction formats, and data types. When using generative AI for item development, they need to understand the underlying principles, operate item-generation tools, and develop effective prompt design skills (Chan et al., 2025). Second, the complexity of assessment data requires stronger data processing and analytical skills. Digital assessments generate complex data, such as multimodal process data capturing students’ response behaviors. Extracting reliable measurement evidence from such data is essential for ensuring valid and reliable inferences. Similarly, learning platforms generate process data for embedded assessment; although automated outputs are available, teachers need advanced and targeted analytical techniques to interpret learning processes and learner characteristics in a deeper and more personalized manner (Delgado et al., 2020). When using automated scoring tools for open-ended tasks (e.g., essays), teachers must also understand the basic principles of complex technologies such as LLMs and be proficient in their operation. Third, teachers need enhanced skills in applying measurement results to inform personalized instruction. In digital assessment and learning environments, multiple data sources—including assessment outcomes, response process data, learning process data, and student background information—can be integrated to generate personalized instructional strategies. This requires teachers to possess more advanced modeling capabilities and diverse skills in designing instructional instruments aligned with assessment and learning platforms. Fourth, intelligent assessment demands strong ethical awareness and dispositions. Digital assessment systems collect large volumes of data and embed various intelligent processing techniques. Teachers must prevent data breaches to protect student privacy, ensure equitable use of technology to maintain the objectivity and fairness of assessment processes and outcomes, and identify potential algorithmic biases and errors in AI-based assessment systems (Celik et al., 2022).

Overall, digital assessment requires teachers to develop expanded competencies in designing, implementing, administering, reporting, and addressing ethical implications. These competencies have not been systematically addressed in existing assessment literacy frameworks. Incorporating them is therefore essential for redefining the TML framework and advancing both measurement practice and teacher development.

The TML Framework: A Comprehensive and Practice-Oriented Model

The proposed TML framework includes three primary dimensions—Knowledge, Skills, and Attitudes—encompassing 14 key elements that reflect both enduring principles of measurement and emerging challenges in digital assessment environments.

The knowledge dimension extends beyond basic assessment concepts to include contemporary measurement theories (e.g., IRT, CDM) and computational psychometrics (von Davier et al., 2022). Teachers need to understand not only how to use tests but how measurement models function, what assumptions they rest on, and how to critically interpret results. Data processing knowledge, including familiarity with software tools such as SPSS or data dashboards, is also essential as data interpretation becomes increasingly central to instructional design (Aburizaizah, 2021). This reflects today’s measurement literate teacher must navigate across analog and digital paradigms, moving fluently between theory, algorithmic logic, and classroom pragmatics.

In light of skill dimension, the TML emphasizes practical skills in selecting, constructing, administering, and analyzing assessments. This extends to interpreting data and applying results to inform teaching and learning (Aburizaizah, 2021; DeLuca et al., 2013). For example, skills in test development and data interpretation empower teachers to move from passive test users to active, responsive professionals who use assessment to adapt instruction and support learning diversity. This skill orientation supports the increasing global emphasis on formative, authentic, and performance-based assessment (Inman & Roberts, 2021). Moreover, the capacity to interpret and apply data aligns with the rise of data-use cultures in schools, where teachers are expected to collaborate around evidence, adjust instruction, and communicate results to stakeholders (Inman & Roberts, 2021).

Beyond knowledge and skills, the framework includes a values-driven component. With the proliferation of algorithmic assessment tools and automated scoring systems, the ethical use of measurement has never been more vital (Ackerman et al., 2024). Teachers must develop a responsible and ethical stance toward measurement—understanding issues such as privacy, fairness of data use, and the algorithmic bias (Bulut et al., 2024). Attitudes like openness to learning (measurement wellness) and a commitment to student-centered interpretation (measurement responsibility) are essential in countering the risks of data misuse or algorithmic bias. By anchoring TML in ethical and reflective orientations, the framework addresses not only what teachers must know and do, but how they must think and feel about assessment.

Implications for Teacher Development and Assessment Practice

The reconceptualized TML framework has significant implications for teacher education. First, it necessitates a reengineering of pre-service curricula. In the context of digital assessment, teachers require more robust and theoretically grounded training in psychometrics, data analytics, and ethical judgment (Ackerman et al., 2024; Randall et al., 2021; Russell et al., 2019). Second, the framework underscores the need for longitudinal and developmental approaches to professional learning. TML is not static; it evolves with new tools, new data sources, and new societal expectations. Hence, professional development must be iterative, reflective, and embedded in authentic teaching contexts (Willis et al., 2013). Third, the framework can inform the reconstruction of teacher performance standards and licensing criteria. As measurement becomes central to teacher evaluation, instructional improvement, and accountability, it is imperative that such evaluations reflect deep and ethical measurement knowledge—not merely test scores or compliance with assessment protocols.

The framework also invites a rethinking of assessment practice. Policymakers often assume that more data equals better decisions. Yet, without TML, teachers may lack the capacity to critique, contextualize, or challenge the interpretations imposed by external testing regimes (Brousselle & Buregeya, 2018). A measurement-literate teaching force is therefore essential not only for pedagogical integrity but also for democratic educational governance. Furthermore, the framework aligns with global movements toward assessment for learning and balanced assessment systems (Marion et al., 2024). In these paradigms, teachers are not passive recipients of test results but central players in co-constructing meaningful and equitable evidence of student learning. Finally, as AI and platform technologies increasingly shape how assessments are designed and delivered, the TML framework can serve as a bulwark against over-automation. Teachers must retain epistemic authority in interpreting data and shaping the conditions under which it is generated, used, or rejected.

Limitations and Research Directions

This study has several limitations. First, the initial development of the framework could have been strengthened by involving practitioners such as in-service teachers and current pre-service teachers, which would have enhanced the framework’s relevance to real teaching and assessment practices. Second, to ensure objectivity in data collection during the validation process, we employed dichotomously scored (0/1) items. While this approach enhanced scoring objectivity, it also posed challenges for data analysis and may have led to an underestimation of the framework’s reliability and validity. For example, the knowledge dimension included items with higher difficulty levels, which may have prompted “guessing” behavior among participants, thereby lowering the internal consistency of that dimension. Third, the empirical validation was based on a relatively small sample size and lacked longitudinal data, limiting the generalizability of the findings to a broader population. As a result, further validation of the conceptual framework in diverse and practical settings remains necessary.

These limitations also point to possible directions for future research. The TML conceptual framework can be applied in a wider range of empirical investigations, particularly in cross-cultural contexts, to further validate its structure and explore the development of teachers’ measurement literacy. Such studies could offer more targeted insights into how to support teachers in improving their knowledge, skills, and attitudes related to educational assessment. It is also hoped that more researchers will continue to engage with the topic of TML, conducting deeper and more systematic inquiries that contribute to the ongoing advancement of educational measurement and teacher education.

Conclusion

In redefining Teacher Measurement Literacy, this study advances a vision of teachers as informed, ethical, and reflective practitioners equipped to navigate complex assessment landscapes. Specially, it makes three core contributions to the fields of teacher education and educational assessment: (1) The proposed TML framework highlights the importance of measurement knowledge, skills, and attitudes, comprising 14 elements. (2) Empirical evidence supports the proposed three-dimensional structure. (3) The proposed framework provides a theoretical foundation to inform teacher education, professional development, and empirical research.

As education systems confront rapid change, the proposed framework offers a comprehensive model that aligns with the realities and possibilities of contemporary assessment. By emphasizing not only what teachers should know and do but also how they should think and feel about measurement, the TML framework lays a strong foundation for cultivating reflective, ethical, and competent educators. Its adoption in research, practice, and policy could contribute meaningfully to more equitable, effective, and data-literate educational systems.

Footnotes

Acknowledgments

The authors would like to express their gratitude to the experts who provided invaluable comments during the research process, and to the participating teachers for their data support for this study. In addition, gratitude is extended to all the researchers for their dedication and support of this study.

ORCID iD

Jianlin Yuan

Ethical Considerations

The design of this study followed the guidelines and regulations of the Declaration of Helsinki and was approved by China Basic Education Quality Monitoring Collaborative Innovation Center with the approval number: 2020-07, dated March 13, 2020.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by Education Bureau of Hunan Province (grant number Z2023045), and Hunan Office of Philosophy and Social Science (grant number 20YBA056).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data can be made available when requested by interested researchers. The data cannot be made public due to ethical and privacy issues. *

Appendix

Table A1.

The Illustration of TML Framework

Dimension	Key elements	Description
Knowledge	Fundamental conceptual	Master the fundamental concepts and terminology within the field of measurement, including traditional concepts such as reliability, validity, difficulty, discrimination, bias, scales, and norms, as well as the basic meanings of summative assessment, formative assessment, and value-added assessment, and key concepts in digital assessment contexts such as automated scoring, learning analytics, machine learning, and intelligent assessment.
	Measurement theories	Understanding the major measurement theories and models—such as classical test theory, item response theory, cognitive diagnostic theory, equating theory, and computational psychometrics—including their conceptual foundations, application principles, scopes of use, and respective strengths and limitations.
	Data processing	Mastery of commonly used data processing methods and basic data analysis tools, including an understanding of the principles, meanings, and limitations of concepts such as means, standard deviations, variability, correlation, and growth curves, as well as familiarity with data analysis software and platforms such as Excel, SPSS, and learning analytics systems.
	Test development	Understanding the basic types of tests (e.g., standardized tests, norm-referenced tests, formative and summative assessments, adaptive tests, computer-based and interactive assessments), the fundamental processes of test development, measurement instruments (e.g., scales, test papers, interactive tasks), item formats (constructed-response and selected-response items), and different modes of administration (online vs. offline, group vs. individual), including their respective advantages, limitations, and appropriate contexts of use. This also includes knowledge of the key components of different types of tests and important considerations in test design.
	Applying test	Understanding the meanings represented by results from different types of tests. Knowing how to use test results to inform instructional practice (e.g., diagnosing students’ learning status, providing individualized guidance, and optimizing instructional processes), and being able to apply learning theories (e.g., the zone of proximal development) to dynamically adjust assessment content and methods based on assessment results and students’ actual conditions. This also includes the ability to leverage measurement results from digital platforms to support personalized instruction and tutoring.
Skills	Test selection	Being able to select measurement instruments that are appropriate to students’ ability levels based on instructional content and assessment objectives. Being able to choose suitable test types and determine assessment administration procedures according to the assessment context. This also includes the ability to use intelligent assessment systems to select appropriate sets of measurement tools.
	Test development	Being able to scientifically develop measurement instruments based on instructional content, assessment purposes, and students’ ability levels. This includes designing multiple-choice items, open-ended items, and contextualized tasks; using digital assessment tools to construct test items and intelligent platforms for automated test assembly; and collaborating with technical specialists to develop interactive tests and technology-enhanced items.
	Test administration	Appropriately using measurement instruments, establishing assessment administration procedures, and conducting testing smoothly, while being able to handle unexpected situations to ensure proper implementation. This also includes familiarity with commonly used school-based online assessment systems and the ability to organize students to successfully complete online assessment activities.
	Scoring	Being able to develop scoring rules and criteria, and to assign scores to students’ responses in accordance with established scoring standards while ensuring objectivity and fairness. This also includes the ability to use digital tools to improve the efficiency of scoring.
	Data processing and analysis	Being able to effectively process and analyze collected assessment data in accordance with assessment purposes, or to use digital assessment platforms for data visualization. This also includes integrating assessment data with students’ background information, learner characteristics, and instructional data, and conducting in-depth analyses to generate evidence for improving teaching.
	Data interpretation and application	Understanding the meanings represented by results obtained from different data processing methods and being able to interpret them correctly. Being able to comprehend the various assessment results reported by digital testing platforms. Using diagnostic results to optimize instructional processes, improve instructional design, and develop targeted teaching improvement plans.
Attitudes	Measurement perceptions	Being able to critically recognize the value of educational measurement, understanding its significance for diagnosing student learning and improving teaching, while also being aware of its limitations. Viewing digital and intelligent assessment critically, maintaining the central role of human judgment, and avoiding overreliance on digital or intelligent assessment tools. One should not blindly trust the data.
	Measurement wellness	Willing to expand and update one’s knowledge and skills in educational measurement, proactively exploring and innovating assessment methods (including digital and intelligent assessments) to optimize assessment processes and enhance their functions. Actively seeking ways to apply assessment results to improve teaching.
	Measurement responsibility	Paying attention to the ethical issues of educational measurement, including data security, personal privacy, fairness and validity of assessments, and the ethics of intelligent algorithms. Viewing measurement results comprehensively and objectively, and reporting them responsibly to relevant stakeholders.

References

Aburizaizah

S. J.

(2021). Data-informed educational decision making to improve teaching and learning outcomes of EFL. Journal of Education and Learning, 10(5), 17–29. https://doi.org/10.1111/bjet.13407

Ackerman

T. A.

Bandalos

D. L.

Briggs

D. C.

Everson

H. T.

A. D.

Lottridge

S. M.

Wind

S. A.

Sinharay

Rodriguez

M. C.

Russell

von Davier

A. A.

(2024). Foundational competencies in educational measurement. Educational Measurement: Issues and Practice, 43(3), 7–17. https://doi.org/10.1111/emip.12581

Adams

R. J.

Wilson

Wang

(1997). The multidimentional random coefficients multinominal logit model. Applied Psychological Measurement, 21(1), 1–23. https://doi.org/10.1177/0146621697211001

Alkharusi

Kazem

A. M.

Al-Musawai

(2011). Knowledge, skills, and attitudes of preservice and inservice teachers in educational measurement. Asia-Pacific Journal of Teacher Education, 39(2), 113–123. https://doi.org/10.1080/1359866X.2011.560649

Binkley

Erstad

Herman

Raizen

Ripley

Miller-Ricci

Rumble

(2012). Defining twenty-first-century skills. In: Assessment and Teaching of 21st Century Skills, (pp. 17–66). Springer Science & Business Media. https://doi.org/10.1007/978-94-007-2324-5_2

Bond

T. G.

Fox

C. M.

(2013). Applying the rasch model: Fundamental measurement in the human sciences. Psychology Press. https://doi.org/10.4324/9781410614575

Brookhart

S. M.

(2011). Educational assessment knowledge and skills for teachers. Educational Measurement: Issues and Practice, 30(1), 3–12. https://doi.org/10.1111/j.1745-3992.2010.00195.x

Brookhart

S. M.

(2024). Educational assessment knowledge and skills for teachers revisited. Education Sciences, 14(7), 751. https://doi.org/10.3390/educsci14070751

Brousselle

Buregeya

J. M.

(2018). Theory-based evaluations: Framing the existence of a new theory in evaluation and the rise of the 5th generation. Evaluation, 24(2), 153–168. https://doi.org/10.1177/1356389018765487

10.

Bulut

Beiting-Parrish

Casabianca

J. M.

Slater

S. C.

Jiao

SongMorilova

D. P.

(2024). The rise of artificial intelligence in educational measurement: Opportunities and ethical challenges. arXiv Preprint arXiv:2406.18900. https://doi.org/10.59863/MIQL7785

11.

Butler-Henderson

Crawford

(2020). A systematic review of online examinations: A pedagogical innovation for scalable authentication and integrity. Computers & Education, 159(5), 104024. https://doi.org/10.1016/i.compedu.2020.104024

12.

Celik

Dindar

Muukkonen

Järvelä

(2022). The promises and challenges of artificial intelligence for teachers: A systematic review of research. TechTrends, 66(4), 616–630. https://doi.org/10.1007/s11528-022-00715-y

13.

Chan

K. W.

Ali

Park

Sham

K. S. B.

Tan

E. Y. T.

ChongSze

F. W. C. G. K.

(2025). Automatic item generation in various STEM subjects using large language model prompting. Computers and Education: Artificial Intelligence, 8(6), 100344. https://doi.org/10.1016/j.caeai.2024.100344

14.

Clough

Shehabi

Morgan

(2016). Medical risk assessment in dentistry: Use of the American society of anesthesiologists physical status classification. British Dental Journal, 220(3), 103–108. https://doi.org/10.1038/sj.bdj.2016.87

15.

Comrey

A. L.

Lee

H. B.

(1992). A first course in factor analysis (2nd ed.). Psychology Press. https://doi.org/10.4324/9781315827506

16.

Coombe

Davidson

(2022). Language assessment literacy. In Mohebbi

Coombe

(Eds.), Research questions in language education and applied linguistics: A reference guide (pp. 343–347). Springer. https://doi.org/10.1007/978-3-030-79143-8

17.

Daniel

L. G.

King

D. A.

(1998). Knowledge and use of testing and measurement literacy of elementary and secondary teachers. The Journal of Educational Research, 91(6), 331–344. https://doi.org/10.1080/00220679809597563

18.

Delgado

H. O. K.

de Azevedo Fay

Sebastiany

M. J.

Silva

A. D. C.

(2020). Artificial intelligence adaptive learning tools: The teaching of English in focus. BELT-Brazilian English Language Teaching Journal, 11(2), e38749. https://doi.org/10.15448/2178-3640.2020.2.38749

19.

DeLuca

Chavez

Cao

(2013). Establishing a foundation for valid teacher judgement on student learning: The role of pre-service assessment education. Assessment in Education: Principles, Policy & Practice, 20(1), 107–126. https://doi.org/10.1080/0969594X.2012.668870

20.

DeLuca

Klinger

Searle

Shulha

(2010). Developing a curriculum for assessment education. Assessment Matters, 2, 133–155. https://doi.org/10.3316/INFORMIT.330301976917998

21.

DeLuca

LaPointe-McEwan

Luhanga

(2016). Teacher assessment literacy: A review of international standards and measures. Educational Assessment, Evaluation and Accountability, 28(3), 251–272. https://doi.org/10.1007/s11092-015-9233-6

22.

Duckworth

A. L.

Yeager

D. S.

(2015). Measurement matters: Assessing personal qualities other than cognitive ability for educational purposes. Educational Researcher, 44(4), 237–251. https://doi.org/10.3102/0013189X15584327

23.

Ebel

R. L.

(1961). Improving the competence of teachers in educational measurement. The clearing house. A Journal of Educational Strategies, Issues and Ideas, 36(2), 67–71. https://doi.org/10.1080/00098655.1961.11475810

24.

Eisinga

Grotenhuis

M. T.

Pelzer

(2013). The reliability of a two-item scale: Pearson, cronbach, or spearman-brown? International Journal of Public Health, 58(4), 637–642. https://doi.org/10.1007/s00038-012-0416-3

25.

Estaji

Banitalebi

Brown

G. T.

(2024). The key competencies and components of teacher assessment literacy in digital environments: A scoping review. Teaching and Teacher Education, 141, 104497. https://doi.org/10.1016/j.tate.2024.104497

26.

Foster

Piacentini

(Eds.), (2023). Innovating assessments to measure and support complex skills. OECD Publishing. https://doi.org/10.1787/e5f3e341-en

27.

Gotch

C. M.

(2012). An investigation of teacher educational measurement literacy. Washington State University. https://www.proquest.com/dissertations-theses/investigation-teacher-educational-measurement/docview/1112071363/se-2

28.

Griffin

Care

Harding

S. M.

(2014). Task characteristics and calibration. In Assessment and teaching of 21st century skills: Methods and approach (pp. 133–178). Springer Netherlands. https://doi.org/10.1007/978-94-017-9395-7_7

29.

Gummer

E. S.

Mandinach

E. B.

(2015). Building a conceptual framework for data literacy. Teachers College Record: The Voice of Scholarship in Education, 117(4), 1–22. https://doi.org/10.1177/016146811511700401

30.

Ibna Seraj

P. M.

Chakraborty

Mehdi

Roshid

M. M.

(2022). A systematic review on pedagogical trends and assessment practices during the COVID‐19 pandemic: Teachers’ and Students’ perspectives. Education Research International, 2022(1), 1534018. https://doi.org/10.1155/2022/1534018

31.

Inman

T. F.

Roberts

J. L.

(2021). Authentic, formative, and informative: Assessment of advanced learning. In Modern curriculum for gifted and advanced academic students (pp. 205–236). Routledge.

32.

Jarr

K. A.

(2012). Education practitioners' interpretation and use of assessment results, Doctoral dissertation. University of Iowa. https://doi.org/10.17077/etd.35vh2oc1

33.

Kershaw IV

(1993). Ohio vocational education teachers' perceived use of student assessment information in educational decision-making. The Ohio State University. https://www.proquest.com/dissertations-theses/ohio-vocational-education-teachers-perceived-use/docview/304060852/se-2

34.

Kubiszyn

Borich

G. D.

(2024). Educational testing and measurement (pp. 1–23). John Wiley & Sons.

35.

Lambert

N. M.

(1991). The crisis in measurement literacy in psychology and education. Educational Psychologist, 26(1), 23–35. https://doi.org/10.1207/s15326985ep2601_2

36.

Looney

Cumming

van Der Kleij

Harris

(2018). Reconceptualising the role of teachers as assessors: Teacher assessment identity. Assessment in Education: Principles, Policy & Practice, 25(5), 442–467. https://doi.org/10.1080/0969594X.2016.1268090

37.

Looney

J. W.

(2009). Assessment and Innovation in Education. OECD Education Working Papers, No. 24. OECD Publishing (NJ1). https://doi.org/10.1787/222814543073

38.

Maity

Deroy

(2024). The future of learning in the age of generative ai: Automated question generation and assessment with large language models. arXiv Preprint. https://doi.org/10.48550/arXiv.2410.09576

39.

Marion

S. F.

Pellegrino

J. W.

Berman

A. I.

(2024). Reimagining balanced assessment systems. National Academy of Education.

40.

Mertler

C. A.

(2003). Preservice versus inservice teachers' assessment literacy: Does classroom experience make a difference? https://policycommons.net/artifacts/14881227/preservice-versus-inservice-teachers-assessment-literacy/15779931/

41.

Mertler

C. A.

(2004). Secondary teachers' assessment literacy: Does classroom experience make a difference? American Secondary Education, pp. 49–64. https://www.jstor.org/stable/41064623

42.

Newton

P. E.

Shaw

S. D.

(2014). Validity in educational and psychological assessment. Sage.

43.

Pastore

Andrade

H. L.

(2019). Teacher assessment literacy: A three-dimensional model. Teaching and Teacher Education, 84, 128–138. https://doi.org/10.1016/j.tate.2019.05.003

44.

Plake

B. S.

Impara

J. C.

Fager

J. J.

(1993). Assessment competencies of teachers: A national survey. Educational Measurement: Issues and Practice, 12(4), 10–12. https://doi.org/10.1111/j.1745-3992.1993.tb00548.x

45.

Popham

W. J.

(2011). Assessment literacy overlooked: A teacher educator's confession. The Teacher Educator, 46(4), 265–273. https://doi.org/10.1080/08878730.2011.605048

46.

Popham

W. J.

(2018). Assessment literacy for educators in a hurry. ASCD.

47.

Randall

Rios

J. A.

Jung

H. J.

(2021). Graduate training in educational measurement and psychometrics: A curriculum review of graduate programs in the US. Practical Assessment, Research & Evaluation, 26, 2. https://doi.org/10.7275/y1v0-wm37

48.

Rudner

L. M.

Schafer

W. D.

(2002). What teachers need to know about assessment. Washington, DC: National Education Association.

49.

Russell

Ludlow

O'Dwyer

(2019). Preparing the next generation of educational measurement specialists: A call for programs with an integrated scope and sequence. Educational Measurement: Issues and Practice, 38(4), 78–86. https://doi.org/10.1111/emip.12285

50.

Schafer

W. D.

(1991). Essential assessment skills in the professional education of teachers. Educational Measurement: Issues and Practice, 10(1), 3–6. https://doi.org/10.1111/j.1745-3992.1991.tb00170.x

51.

Stiggins

(1991). Assessment literacy. The Phi Delta Kappan, 72(7), 534–539. https://www.elibrary.ru/item.asp?id=1583759.

52.

Sudakova

N. E.

Savina

T. N.

Masalimova

A. R.

Mikhaylovsky

M. N.

Karandeeva

L. G.

Zhdanov

S. P.

(2022). Online formative assessment in higher education: Bibliometric analysis. Education Sciences, 12(3), 209. https://doi.org/10.3390/educsci12030209

53.

Taylor

(2009). Developing assessment literacy. Annual Review of Applied Linguistics, 29, 21–36. https://doi.org/10.1017/S0267190509090035

54.

Timmis

Broadfoot

Sutherland

Oldfield

(2016). Rethinking assessment in a digital age: Opportunities, challenges and risks. British Educational Research Journal, 42(3), 454–476. https://doi.org/10.1002/berj.3215

55.

Villegas

A. M.

Irvine

J. J.

(2010). Diversifying the teaching force: An examination of major arguments. The Urban Review, 42(3), 175–192. https://doi.org/10.1007/s11256-010-0150-1

56.

von Davier

A. A.

Mislevy

R. J.

Hao

(Eds.), (2022). Computational psychometrics: New methodologies for a new generation of digital learning and assessment. Springer Nature.

57.

Willis

Adie

Klenowski

(2013). Conceptualising teachers' assessment literacies in an era of curriculum and assessment reform. The Australian Educational Researcher, 40(2), 241–256. https://doi.org/10.1007/s13384-013-0089-9

58.

Woolfolk

(2016). Educational psychology. Pearson. https://thuvienso.hoasen.edu.vn/handle/12345678919007

59.

Wylie

E. C.

Heritage

(2024). Assessment literacy and professional learning. In Marion

Pellegrino

Berman

(Eds.), Reimagining balanced assessment systems. National academy of education. National Academy of Education. https://doi.org/10.31094/2024/1/5

60.

Brown

G. T.

(2016). Teacher assessment literacy in practice: A reconceptualization. Teaching and Teacher Education, 58, 149–162. https://doi.org/10.1016/j.tate.2016.05.010

61.

(2022). Urban public-school teacher assessment literacy in China Zhengzhou city. Education, 51(1), 3–13. https://doi.org/10.1080/03004279.2021.2025128