Abstract
While assessment literacy has long guided teacher’ assessment practice, the rise of digital assessment presents new challenges in educational measurement, necessitating a renewed focus on Teacher Measurement Literacy (TML). This study reconceptualizes TML and develops a validated framework to support teachers in this evolving context. Using the Delphi method, a three-dimensional, 14-element TML framework was refined through two rounds of expert consultation, demonstrating increased consensus. A corresponding instrument was then administered to 306 pre-service teachers. Confirmatory factor analysis supported the proposed three-dimensional structure. Furthermore, likelihood ratio tests comparing nested Rasch models indicated that a three-dimensional model provided a significantly better fit to the data than a unidimensional alternative. Cross-validation with a sample of 297 in-service teachers further supported the robustness of the multidimensional framework. These findings validate the proposed three-dimensional conception of TML and offer empirical grounding to strengthen teachers’ capacity in the context of evolving measurement practices.
Introduction
Although scholars proposed teacher Measurement Literacy at an early stage (Ebel, 1961; Gotch, 2012; Lambert, 1991) to strengthen teachers’ professionalism in assessment practice, existing research has tended to treat measurement-related elements as integral components of assessment literacy (DeLuca et al., 2010; Pastore & Andrade, 2019; Popham, 2018). Assessment literacy is now conceived of as a complex interplay of different components interrelated with social, cultural, policy, professional, and experiential factors (Brookhart, 2024; Pastore & Andrade, 2019). Many definitions encompass not only micro-level measurement knowledge and skills but also macro-level educational value judgments and decision-making (Pastore & Andrade, 2019; Popham, 2018). Such an expansive conceptualization not only blurs the distinction between educational measurement and assessment (Kubiszyn & Borich, 2024) but also overlooks teachers’ knowledge and skills in measurement, thereby diminishing the value of measurement in education.
Promoting teachers’ measurement literacy is essential, as the accuracy and validity of measurement underpin effective assessment and educational decision-making (Newton & Shaw, 2014). Measurement enables teachers to engage with objective evidence, gain deeper insights into student learning, and make informed, equitable decisions (Looney, 2009; Villegas & Irvine, 2010). Teachers with strong measurement literacy can critically evaluate validity assumptions, identify bias, and interpret data contextually (Wylie & Heritage, 2024). Although many studies emphasize these competencies within assessment literacy, we prefer the term measurement literacy, as these knowledge and skills are specifically required for micro-level measurement practices.
The development of digital assessment highlights the need to reaffirm the value of teachers’ measurement literacy. Digital assessment generates richer assessment data, alongside a growing diversity of data analysis methods and models, including process data from interactive assessments of 21st-century skills (Foster & Piacentini, 2023), and real-time learning analytics feedback from digital learning platforms (von Davier et al., 2022). These data differ substantially from traditional test scores in terms of granularity, timing, and interpretive demands, requiring teachers not only to understand how data are generated, structured, and constrained by measurement models, but also to apply advanced analytical approaches, such as machine learning and data mining, to explore response processes and learning trajectories (Estaji et al., 2024). Such competencies have often been discussed under the umbrella of teacher data literacy (Gummer & Mandinach, 2015). However, data literacy alone does not fully address the designing, implementing, administering, reporting, and ethical implications embedded in digital assessment practices.
In general, measurement knowledge and skills are an essential component of teachers’ professional development (Rudner & Schafer, 2002; Schafer, 1991). As digital technologies reshape educational measurement, studies increasingly emphasize the need to revise teacher assessment literacy frameworks to place greater emphasis on measurement-related knowledge and skills (Coombe & Davidson, 2022; Estaji et al., 2024). Accordingly, we therefore argue that, alongside the concept of assessment literacy, it is necessary to foreground the concept of teachers’ measurement literacy in order to better support teachers in developing measurement-related knowledge and skills, enabling them to make fuller use of digital assessment to improve teaching and student learning.
Literature Review
Teacher Assessment Literacy
Assessment literacy has long been recognized as a core competence for teachers (Brookhart, 2011; Stiggins, 1991). Early definitions emphasized measurement theory (DeLuca et al., 2010; Mertler, 2003) and the practical knowledge, skills, and behaviors needed for effective assessment (Stiggins, 1991; Taylor, 2009). Stiggins (1991), for instance, defined advanced assessment literates “must also master test development, administration and scoring, and classical and modem psychometric theory, as well as having the experience to create large-scale assessments of demonstrated validity, reliability, and economy.” At this stage, assessment literacy was largely associated with standardized testing and classroom assessment (Pastore & Andrade, 2019). Seeking to integrate macro- and micro-level perspectives, Brookhart (2011) further redefined assessment literacy through the lens of professional teaching standards, identifying 11 statements that delineate assessment knowledge and skills of teachers. Since then, assessment literacy has evolved into a multifaceted construct embedded in teacher professional standards and preparation programs (DeLuca et al., 2016; Looney et al., 2018).
More recent scholarship situates teacher assessment literacy (TAL) within socio-cultural contexts. Willis et al. (2013) describe TAL as a “dynamic, context-dependent social practice,” in which teachers and learners negotiate classroom and cultural knowledge through assessment to achieve learning goals. This view informs later work (Estaji et al., 2024; Ye, 2022), which integrates socio-cultural factors with teacher education, producing frameworks that connect assessment to teacher preparation. Other scholars, however, maintain a measurement-oriented stance. Popham (2018) defines TAL through six theoretical concepts—including validity, reliability, and fairness—together with practical assessment procedures. Such divergence highlights the plurality of perspectives in the field (Pastore & Andrade, 2019).
Broadly, TAL can be divided into two orientations: one emphasizing measurement knowledge and skills as its core (DeLuca et al., 2010; Mertler, 2003; Popham, 2011; Stiggins, 1991), and another foregrounding the socio-cultural dimensions of assessment, which has become more prominent in recent research.
However, the expansion of online education and digital technologies has generated new demands for teachers’ assessment competencies (Timmis et al., 2016). Teachers are often expected to make assessment-informed decisions based on learning analytics dashboards that integrate both traditional and novel data in dynamic, customizable visualizations, as well as automated scoring outputs and process indicators. Scholars have called for revisions to existing TAL frameworks to better equip teachers to meet the demands of digitally mediated assessment practices and to support their everyday engagement with complex, data-rich measurement environments (Coombe & Davidson, 2022; Estaji et al., 2024). Increasing attention is now given to digital assessment literacy (Butler-Henderson & Crawford, 2020; Ibna Seraj et al., 2022; Sudakova et al., 2022; Xu & Brown, 2016) and data literacy (Gummer & Mandinach, 2015). Nevertheless, the foundational role of educational measurement remains indispensable.
Teacher Measurement Literacy
Compared with assessment literacy, teacher measurement literacy (TML) has received far less attention in educational research. Existing studies are relatively few and often outdated, with limited scholarly contributions over the past decade (Alkharusi et al., 2011; Daniel & King, 1998; Ebel, 1961; Gotch, 2012). Early work associated TML almost exclusively with standardized testing. Ebel (1961) argued that competent teachers must understand both the purposes and limitations of examinations and be able to design test items. Similarly, Gotch (2012) defined TML as the ability to use and interpret standardized tests. Other scholars emphasized teachers’ responsibility to develop, administer, analyze, and apply test data (Ebel, 1961; Lambert, 1991). Beyond these foundations, references to TML in teacher education literature have been sporadic and inconsistent (Woolfolk, 2016).
Some empirical studies have attempted to frame and measure TML. Alkharusi et al. (2011) examined TML across three dimensions—skills, knowledge, and attitudes—highlighting teachers’ ability to design, score, and apply paper-based tests. Gotch (2012) developed the Teacher Educational Measurement Literacy Scale, comprising two subscales: one assessing measurement knowledge, the other self-perceived efficacy. Findings indicate that in-service teachers often possess limited measurement knowledge and rely heavily on experience and intuition when judging student performance (Daniel & King, 1998). However, they tend to outperform pre-service teachers in practical skills and demonstrate more positive attitudes toward measurement (Alkharusi et al., 2011). Recent work has examined educational measurement coursework and related content areas aimed at building teachers’ capacity for measurement (Randall et al., 2021; Russell et al., 2019).
Advances of digital and artificial intelligence technologies has transformed educational measurement while placing greater demands on TML. Computer-based testing, now common in large-scale assessments such as PISA and TIMSS, has extended measurement beyond traditional knowledge and skills (Maity & Deroy, 2024). These tools enable the assessment of complex abilities, such as complex problem-solving, collaborative problem solving and learning in the digital world (Foster & Piacentini, 2023), creating challenges distinct from conventional testing.
Yet, most existing conceptualizations of teacher measurement literacy remain anchored in paper-based testing and traditional score interpretation. They offer limited guidance for helping teachers design digital assessment tasks, handle rich and complex assessment data, understand inference processes based on advanced models, and use digital tools to apply assessment results to improve instruction. This misalignment between current measurement practices and teachers’ professional preparation highlights the need to reconceptualize TML to better reflect the demands of digital and AI-driven assessment environments.
The Present Study
The literature shows extensive research on teacher assessment literacy, whereas work on teacher measurement literacy remains limited and underdeveloped. Although acknowledged since the 1960s (Ebel, 1961), studies of TML are fragmented and lack a cohesive framework. Most conceptualizations are either subsumed under broader assessment literacy or narrowly focused on traditional tasks such as designing and administering standardized tests (Brookhart, 2024; DeLuca et al., 2010; Mertler, 2003; Stiggins, 1991). Meanwhile, technological developments have transformed assessment practices and posed new challenges for teachers in designing new tools, analyzing and interpreting assessment data, and applying findings effectively. Against this backdrop, we argue that, alongside assessment literacy, greater attention should be paid to measurement knowledge and skills at the micro level. We therefore propose a clearer conceptualization of TML to support teachers in adapting to the ongoing transformation of assessment practices. This study aims to: (1) Reconceptualize TML, identify its essential elements, and construct a framework; (2) Validate the proposed TML framework using survey-based empirical data.
Study 1: Development of TML Framework
Participants
We adopted strict criteria for expert selection, requiring substantial expertise in educational measurement or teacher development, active academic influence, and diverse national and professional backgrounds. Based on these criteria, 20 experts were identified through literature and institutional review and invited via email; 16 participated, with 13 in the first round and 12 in the second, including nine Chinese university scholars who took part in both rounds. The panel comprised university scholars and teacher education professionals: 12 from China, one from Australia, two from the United States, and one from Norway. Fourteen experts held doctoral degrees and two held master’s degrees. The panel included six males and 10 females, aged 29–62 years (M = 40.4). Among the 16 experts, 13 specialized in educational measurement and three in teacher professional development.
Constructing the Initial Conceptual Framework
Current research reveals no unified understanding of TML. To address this gap, this study first examined the meanings of key terms such as assessment, educational measurement, and literacy. Based on a comprehensive literature review, we proposed an initial definition: TML refers to the knowledge, skills, and attitudes demonstrated by teachers in selecting, developing, and managing measurement tools; processing and analyzing test data; assessing student performance; and applying results to improve teaching and support student development.
Building on this definition, we integrated insights from traditional measurement literacy frameworks and professional standards (Alkharusi et al., 2011; Gotch, 2012; Randall et al., 2021), together with research on teachers’ assessment literacy in digital environments (Estaji et al., 2024), to identify the key elements of measurement literacy. These include developing, selecting, administering, and managing assessment tools; scoring; processing and analyzing data; and applying results to enhance instructional practice. In terms of framework structure, prior studies commonly emphasized knowledge, skills, and dispositions, which aligns with the KSAVE model (Binkley et al., 2012) that highlights knowledge, skills, attitudes, values, and ethics. Guided by this model, the identified elements of TML were systematically organized into three dimensions: knowledge, skills, and attitudes. This process resulted in an initial conceptual framework comprising three primary dimensions and eleven sub-dimensions.
Expert Consultation
Design and Procedure of Expert Consultation
The expert consultation questionnaire included two components. Experts first reviewed and commented on the initial definition of Teacher Measurement Literacy. They then rated the appropriateness of each framework dimension using a five-point Likert scale (1 = very inappropriate; 5 = very appropriate). Space was provided for open-ended comments and suggestions.
Two rounds of email-based consultation were conducted. In Round 1, experts evaluated the three primary dimensions (knowledge, skills, and attitudes) and their secondary dimensions and provided revision feedback. In Round 2, they re-rated the revised framework. After each round, expert ratings were analyzed to determine whether dimensions met predefined acceptance criteria; dimensions failing to meet these criteria were revised or removed. Appropriateness was evaluated using the mean score, score rate, and full score rate. Inter-expert agreement was assessed using the standard deviation and coefficient of variation (CV), with lower values indicating greater consensus. Consistent with prior research, a mean score of 3.75 and a CV threshold of 0.20 were adopted as decision criteria (Clough et al., 2016). In addition, 57 qualitative comments were coded and integrated to further refine the framework.
Results of the First-Round Consultation
First Round Results of Expert Consultation
Based on the statistical analyses results and 57 expert comments, the framework was revised and refined. A key adjustment was distinguishing classroom assessment from large-scale assessment to clarify the measurement literacy teachers need in different contexts. Within the knowledge dimension, sub-dimension A2 was revised to “Measurement Theory and Statistical Models,” and the descriptions of A1–A5 were refined for greater precision. In the skills dimension, the original sub-dimensions B1–B3 were further disaggregated, leading to three new sub-dimensions: Scoring, Data Processing and Analysis, and Data Interpretation and Application, each representing a distinct aspect of teachers’ competencies. For the attitudes dimension, experts suggested that future research should explore valid and reliable approaches for its assessment.
Results of the Second-Round Consultation
Results of Second Round Expert Consultation
Nevertheless, several experts suggested further refinement of certain sub-dimension definitions. In response, the sub-dimension originally labeled “Measurement theories and statistical models” was renamed “Measurement theories,” and the descriptions of five sub-dimensions (A1, A5, B4, C1, C3) were revised to enhance clarity, precision, and conceptual consistency.
A comparative analysis of the two rounds of consultation revealed improvements across all indicators. The mean values increased from 3.25–4.5 to 4.28–5.0, the standard deviations decreased from 0.707–1.069 to 0–0.951, and CV values declined from 0.168–0.318 to 0–0.222. Collectively, these results indicate reduced dispersion, enhanced consensus, and stronger alignment with the reference standards established earlier (see Section 3.3.1).
Final Conceptual Framework
Regarding the results of the expert consultation, a third round was deemed unnecessary. This decision was made because the data analysis from the second round showed clear improvement compared to the first round. The relevant indicators largely met the established standards, and the revisions suggested by experts required only minor adjustments. Therefore, the revised version of the conceptual framework for TML following the second round of consultation was adopted as the final framework. A visual representation of this framework is shown in Figure 1. Conceptual framework of TML
The conceptual framework of TML developed in this study consists of three primary dimensions (knowledge, skills, and attitudes) and fourteen sub-dimensions. The framework is structured according to increasing levels of abstraction, progressing from knowledge to skills to attitudes. These dimensions are interdependent. Measurement knowledge underpins the development of skills and attitudes. Skills represent the practical application of knowledge and serve as a channel for shaping attitudes. In turn, attitudes influence the administration of skills and the deepening of knowledge. The framework also includes various forms of educational measurement and covers the full measurement process. It outlines the essential knowledge, skills, and attitudes required of teachers, from selecting measurement types and developing tools to analyzing data and applying results to improve instruction and support student learning. Detailed descriptions of each dimension are presented in Appendix Table 1.
Study 2: Empirical Validation of TML Framework
The TML framework was developed using the Delphi method to synthesize expert consensus. To strengthen empirical support, a TML survey instrument was developed and administered in two studies involving pre-service and in-service teachers in China. Empirical validation aimed to examine the theoretical structure of TML through the instrument’s psychometric properties analyses, including reliability, validity, and nested model comparisons, providing objective evidence for the proposed framework.
Survey Instrument Design and Refinement
Survey Instrument Design
Based on the proposed conceptual framework of TML, a survey instrument was developed to assess teachers’ TML. Its development drew on several established tools, including Jarr’s (2012) Self-Efficacy in Assessment Survey, Part II of Kershaw’s (1993) Professional Classroom Assessment Scale, Mertler’s (2004) Classroom Assessment Literacy Inventory and its revision, Daniel and King (1998) Educational Measurement Literacy Scale, and the Teacher Assessment Literacy Scale by Plake et al. (1993). Items from these scales were adapted to align with the framework’s dimensions.
The instrument was designed with several considerations. First, items were intended to elicit objective responses and included two types: factual knowledge questions and situational judgment items. Prior studies suggest Likert-scale items may be affected by subjective bias (Duckworth & Yeager, 2015); therefore, formats emphasizing objectivity were adopted. Knowledge items assessed measurement knowledge (e.g., “Which of the following concepts best reflects the dispersion of a data set?”). Situational items presented problem scenarios requiring students to select the most appropriate response, targeting measurement-related skills and attitudes (e.g., choosing an instructional strategy based on exam score distributions). Second, all items had a single correct answer and were scored dichotomously (0/1), allowing for analysis using both raw scores and item response theory (IRT) models. Third, item development involved extensive team discussions to ensure scenarios authentically reflected teaching and measurement practices. Following this process, the initial version comprised 52 items.
Pre-Testing
A pilot study was conducted with 50 pre-service teachers to examine response patterns and gather feedback. Results indicated that the large number of items and complex wording hindered accurate understanding, producing responses misaligned with participants’ actual situations. Accordingly, 10 items were removed, and language was simplified. For instance, “unsafe” was replaced with “dangerous,” and “incorrect” with “wrong.” Key terms such as “correct” and “wrong” were bolded and highlighted in red to enhance clarity. After revision, the final instrument included 42 items, with each of the 14 second-level dimensions represented by three items.
Instrument Refinement Based on Survey Data
Demographic Characteristics of Pre-Service Teachers
aGovernment-funded normal university students who receive full tuition coverage in exchange for teaching service commitments. National Outstanding Normal Students (NONS) selected through competitive evaluation at national/provincial levels, who typically receive enhanced training and career support. Tuition-paying teacher education students in contrast to government-funded counterparts.
Reliability analysis and confirmatory factor analysis (CFA) were conducted to examine internal consistency and structural validity. As all items were dichotomously scored (0/1), McDonald’s ω was employed due to its suitability for binary data (Eisinga et al., 2013). Analyses were performed using SPSS 26 and Mplus 8.0. Iterative testing and item screening led to the removal of poorly performing items. The initial TML instrument included 42 items. However, preliminary analyses indicated suboptimal psychometric properties. The overall McDonald’s ω was 0.715, reflecting marginally acceptable reliability (Eisinga et al., 2013). Subscale ω coefficients for knowledge, skills, and attitudes were 0.374, 0.565, and 0.677, respectively, indicating inadequate internal consistency. CFA results showed acceptable χ2/df (2.67) and RMSEA (0.059), but poor incremental fit (CFI = 0.689; TLI = 0.585). Factor loadings ranged from 0.011 to 0.815, suggesting weak structural validity.
Accordingly, items with low factor loadings were removed. In total, 17 items were deleted from the knowledge dimension, 6 from skills, and 2 from attitudes, resulting in a revised 25-item instrument. Given that subsequent dimensional analyses focused on the three first-level dimensions—measurement knowledge, skills, and attitudes—we ensured that each second-level dimension was represented by at least one item, thereby preserving observational information across all second-level dimensions. Meanwhile, each first-level dimension was retained with more than three items, comprising 8 items for knowledge, 10 for skills, and 7 for attitudes.
Empirical Validation Based on the Pre-Service Teacher Survey
Using data from 306 pre-service teachers on the 25-item instrument, validation involved reliability analysis and confirmatory factor analysis (CFA), along with model comparisons between a unidimensional Rasch model and a multidimensional within-item MRCML model (Adams et al., 1997), to test whether the proposed three-dimensional structure fit the data better than a unidimensional alternative. The Rasch analyses were conducted using ConQuest 5.
Reliability Test and CFA Analysis Results
McDonald’s ω Coefficient
The overall McDonald’s ω coefficient of the TML survey instrument was 0.903, indicating strong internal consistency and excellent overall reliability across the 25 items. The knowledge dimension yielded a McDonald’s ω of 0.729, suggesting relatively low but acceptable reliability (Eisinga et al., 2013). In contrast, the skills and attitudes dimensions both demonstrated ω coefficients above 0.8, reflecting high internal consistency and good reliability.
CFA Results
Factor Loadings of the 25-Item TMLS
Note. *p < 0.1, **p < 0.05, ***p < 0.01.
Drawing upon results from both reliability analysis and CFA, the TML Survey Instrument demonstrates robust internal consistency and validity. These findings provide strong empirical support for the scientific soundness of the TML conceptual framework and substantiate its applicability in future research contexts.
Rasch Model Fitting Results
Unidimensional Rasch Model Results
Item Fit Information
According to Griffin et al. (2014), the item separation index can be used to assess construct validity, while the person separation index reflects criterion-related validity. Higher separation indices indicate better measurement quality. Thus, in terms of item fit, the Weighted Fit Mean Square (MNSQ) values ranged from 0.64 to 1.23, falling within the acceptable range of 0.5 to 1.5 (Bond & Fox, 2013). Item difficulty was estimated using the model’s logit scale, which spans from −4 to 4. The item difficulties ranged from −1.943 to 1.461, with an average difficulty of 0.0004, suggesting that the overall test was relatively easy.
Additionally, the ConQuest software provided item discrimination indices based on Classical Test Theory (CTT). The discrimination values ranged from 0.29 to 0.81. Most items exceeded the commonly accepted threshold of 0.3. Items 1 and 8 had discrimination values of 0.29, which are still close to the acceptable cutoff.
Three-Dimensional Modeling Results
The 25 items were analyzed using a three-dimensional Within-Item MRCML model, with item-dimension associations aligned with the factor loading structure shown in Table 4. The model yielded a Final Deviance of 8375.353 and demonstrated a statistically significant fit (χ2 = 1011.7, df = 22, P < 0.001). The overall item reliability was 0.980, while the person separation reliability for the dimensions of knowledge, skills, and attitudes, respectively, was 0.797, 0.874, and 0.879, indicating high internal consistency across dimensions. The correlations between dimensions were also relatively strong: 0.730 between knowledge and skills, 0.722 between knowledge and attitudes, and 0.885 between skills and attitudes, suggesting particularly close alignment between skills and attitudes. In terms of item fit, the Weighted Fit Mean Square (MNSQ) statistics ranged from 0.65 to 1.51. Except for item x19, all items fell within the acceptable range of 0.5 to 1.5, indicating a generally good data fit model.
Model Comparison Results
The Likelihood Ratio Test (LRT) is a widely employed method for comparing nested models (Adams et al., 1997). In this study, the unidimensional Rasch model is nested within the three-dimensional Within-Item MRCML, and a likelihood ratio test (LRT) was conducted by calculating the difference in deviance between these models. As reported earlier, the unidimensional model yielded a deviance of 8463.186 with 26 estimated parameters, whereas the multidimensional model produced a lower deviance of 8375.353 with 31 estimated parameters. The resulting χ2 statistic was 87.833 (χ2 (5) = 87.833, p < .001), indicating that the multidimensional model offers a significantly better fit to the data than the unidimensional model. This finding offers strong empirical support for the three-dimensional conceptual framework of TML.
Cross-Validation Based on the In-Service Teacher Survey
In-Service Teachers’ Characteristics
McDonald’s ω Coefficient
The overall McDonald’s ω for the TML instrument was 0.901, indicating excellent internal consistency across the 25 items. At the dimensional level, the knowledge dimension showed acceptable reliability (ω = 0.746; Eisinga et al., 2013), while the skills and attitudes dimensions demonstrated good to high reliability, with ω values of 0.852 and 0.814, respectively.
CFA Results
The CFA results indicated a good model fit: χ2/df = 2.07 (<3), RMSEA = 0.060 (<0.08), CFI = 0.911, TLI = 0.900. These indices confirm that the three-dimensional TML model fits the data well. All 25 items had factor loadings above 0.5: 0.506–0.819 for knowledge, 0.653–0.964 for skills, and 0.599–0.962 for attitudes, demonstrating that the TML scale has strong structural validity.
Model Comparisons Results
For the unidimensional model, the final deviance was 4376.663, with a statistically significant fit (χ2 = 2537.02, df = 24, p < 0.001). Overall item reliability was 0.989, and person separation reliability was 0.761, indicating high measurement precision. Item fit was acceptable, with weighted mean-square (MNSQ) values ranging from 0.77 to 1.20, within the recommended range of 0.5–1.5 (Bond & Fox, 2013).
For the three-dimensional model, the final deviance was 4334.721, with a statistically significant fit (χ2 = 1313.79, df = 22, p < 0.001). Overall item reliability was 0.977, and person separation reliabilities for knowledge, skills, and attitudes were 0.714, 0.738, and 0.687, respectively. Item fit was generally acceptable, with weighted mean-square (MNSQ) values ranging from 0.69 to 1.28, except for item X16, indicating an overall good model-data fit.
Based on the fit results of the unidimensional and multidimensional models, a likelihood ratio test (LRT) was conducted by comparing their deviance values. The resulting χ2 statistic was 41.942 (χ2 (5) = 41.942, p < .001), indicating that the multidimensional model fits the data significantly better than the unidimensional model. This provides strong empirical support for the three-dimensional TML framework.
Discussion
Redefining Measurement Literacy for Assessment Transformation
Digital technologies and AI have profoundly transformed educational assessment, requiring teachers to develop more advanced measurement knowledge and skills than those emphasized in the paper-based and standardized testing.
Overall, digital assessment requires teachers to develop expanded competencies in designing, implementing, administering, reporting, and addressing ethical implications. These competencies have not been systematically addressed in existing assessment literacy frameworks. Incorporating them is therefore essential for redefining the TML framework and advancing both measurement practice and teacher development.
The TML Framework: A Comprehensive and Practice-Oriented Model
The proposed TML framework includes three primary dimensions—Knowledge, Skills, and Attitudes—encompassing 14 key elements that reflect both enduring principles of measurement and emerging challenges in digital assessment environments.
The knowledge dimension extends beyond basic assessment concepts to include contemporary measurement theories (e.g., IRT, CDM) and computational psychometrics (von Davier et al., 2022). Teachers need to understand not only how to use tests but how measurement models function, what assumptions they rest on, and how to critically interpret results. Data processing knowledge, including familiarity with software tools such as SPSS or data dashboards, is also essential as data interpretation becomes increasingly central to instructional design (Aburizaizah, 2021). This reflects today’s measurement literate teacher must navigate across analog and digital paradigms, moving fluently between theory, algorithmic logic, and classroom pragmatics.
In light of skill dimension, the TML emphasizes practical skills in selecting, constructing, administering, and analyzing assessments. This extends to interpreting data and applying results to inform teaching and learning (Aburizaizah, 2021; DeLuca et al., 2013). For example, skills in test development and data interpretation empower teachers to move from passive test users to active, responsive professionals who use assessment to adapt instruction and support learning diversity. This skill orientation supports the increasing global emphasis on formative, authentic, and performance-based assessment (Inman & Roberts, 2021). Moreover, the capacity to interpret and apply data aligns with the rise of data-use cultures in schools, where teachers are expected to collaborate around evidence, adjust instruction, and communicate results to stakeholders (Inman & Roberts, 2021).
Beyond knowledge and skills, the framework includes a values-driven component. With the proliferation of algorithmic assessment tools and automated scoring systems, the ethical use of measurement has never been more vital (Ackerman et al., 2024). Teachers must develop a responsible and ethical stance toward measurement—understanding issues such as privacy, fairness of data use, and the algorithmic bias (Bulut et al., 2024). Attitudes like openness to learning (measurement wellness) and a commitment to student-centered interpretation (measurement responsibility) are essential in countering the risks of data misuse or algorithmic bias. By anchoring TML in ethical and reflective orientations, the framework addresses not only what teachers must know and do, but how they must think and feel about assessment.
Implications for Teacher Development and Assessment Practice
The reconceptualized TML framework has significant implications for teacher education. First, it necessitates a reengineering of pre-service curricula. In the context of digital assessment, teachers require more robust and theoretically grounded training in psychometrics, data analytics, and ethical judgment (Ackerman et al., 2024; Randall et al., 2021; Russell et al., 2019). Second, the framework underscores the need for longitudinal and developmental approaches to professional learning. TML is not static; it evolves with new tools, new data sources, and new societal expectations. Hence, professional development must be iterative, reflective, and embedded in authentic teaching contexts (Willis et al., 2013). Third, the framework can inform the reconstruction of teacher performance standards and licensing criteria. As measurement becomes central to teacher evaluation, instructional improvement, and accountability, it is imperative that such evaluations reflect deep and ethical measurement knowledge—not merely test scores or compliance with assessment protocols.
The framework also invites a rethinking of assessment practice. Policymakers often assume that more data equals better decisions. Yet, without TML, teachers may lack the capacity to critique, contextualize, or challenge the interpretations imposed by external testing regimes (Brousselle & Buregeya, 2018). A measurement-literate teaching force is therefore essential not only for pedagogical integrity but also for democratic educational governance. Furthermore, the framework aligns with global movements toward assessment for learning and balanced assessment systems (Marion et al., 2024). In these paradigms, teachers are not passive recipients of test results but central players in co-constructing meaningful and equitable evidence of student learning. Finally, as AI and platform technologies increasingly shape how assessments are designed and delivered, the TML framework can serve as a bulwark against over-automation. Teachers must retain epistemic authority in interpreting data and shaping the conditions under which it is generated, used, or rejected.
Limitations and Research Directions
This study has several limitations. First, the initial development of the framework could have been strengthened by involving practitioners such as in-service teachers and current pre-service teachers, which would have enhanced the framework’s relevance to real teaching and assessment practices. Second, to ensure objectivity in data collection during the validation process, we employed dichotomously scored (0/1) items. While this approach enhanced scoring objectivity, it also posed challenges for data analysis and may have led to an underestimation of the framework’s reliability and validity. For example, the knowledge dimension included items with higher difficulty levels, which may have prompted “guessing” behavior among participants, thereby lowering the internal consistency of that dimension. Third, the empirical validation was based on a relatively small sample size and lacked longitudinal data, limiting the generalizability of the findings to a broader population. As a result, further validation of the conceptual framework in diverse and practical settings remains necessary.
These limitations also point to possible directions for future research. The TML conceptual framework can be applied in a wider range of empirical investigations, particularly in cross-cultural contexts, to further validate its structure and explore the development of teachers’ measurement literacy. Such studies could offer more targeted insights into how to support teachers in improving their knowledge, skills, and attitudes related to educational assessment. It is also hoped that more researchers will continue to engage with the topic of TML, conducting deeper and more systematic inquiries that contribute to the ongoing advancement of educational measurement and teacher education.
Conclusion
In redefining Teacher Measurement Literacy, this study advances a vision of teachers as informed, ethical, and reflective practitioners equipped to navigate complex assessment landscapes. Specially, it makes three core contributions to the fields of teacher education and educational assessment: (1) The proposed TML framework highlights the importance of measurement knowledge, skills, and attitudes, comprising 14 elements. (2) Empirical evidence supports the proposed three-dimensional structure. (3) The proposed framework provides a theoretical foundation to inform teacher education, professional development, and empirical research.
As education systems confront rapid change, the proposed framework offers a comprehensive model that aligns with the realities and possibilities of contemporary assessment. By emphasizing not only what teachers should know and do but also how they should think and feel about measurement, the TML framework lays a strong foundation for cultivating reflective, ethical, and competent educators. Its adoption in research, practice, and policy could contribute meaningfully to more equitable, effective, and data-literate educational systems.
Footnotes
Acknowledgments
The authors would like to express their gratitude to the experts who provided invaluable comments during the research process, and to the participating teachers for their data support for this study. In addition, gratitude is extended to all the researchers for their dedication and support of this study.
Ethical Considerations
The design of this study followed the guidelines and regulations of the Declaration of Helsinki and was approved by China Basic Education Quality Monitoring Collaborative Innovation Center with the approval number: 2020-07, dated March 13, 2020.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by Education Bureau of Hunan Province (grant number Z2023045), and Hunan Office of Philosophy and Social Science (grant number 20YBA056).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Data can be made available when requested by interested researchers. The data cannot be made public due to ethical and privacy issues.
Appendix
The Illustration of TML Framework
Dimension
Key elements
Description
Knowledge
Fundamental conceptual
Master the fundamental concepts and terminology within the field of measurement, including traditional concepts such as reliability, validity, difficulty, discrimination, bias, scales, and norms, as well as the basic meanings of summative assessment, formative assessment, and value-added assessment, and key concepts in digital assessment contexts such as automated scoring, learning analytics, machine learning, and intelligent assessment.
Measurement theories
Understanding the major measurement theories and models—such as classical test theory, item response theory, cognitive diagnostic theory, equating theory, and computational psychometrics—including their conceptual foundations, application principles, scopes of use, and respective strengths and limitations.
Data processing
Mastery of commonly used data processing methods and basic data analysis tools, including an understanding of the principles, meanings, and limitations of concepts such as means, standard deviations, variability, correlation, and growth curves, as well as familiarity with data analysis software and platforms such as Excel, SPSS, and learning analytics systems.
Test development
Understanding the basic types of tests (e.g., standardized tests, norm-referenced tests, formative and summative assessments, adaptive tests, computer-based and interactive assessments), the fundamental processes of test development, measurement instruments (e.g., scales, test papers, interactive tasks), item formats (constructed-response and selected-response items), and different modes of administration (online vs. offline, group vs. individual), including their respective advantages, limitations, and appropriate contexts of use. This also includes knowledge of the key components of different types of tests and important considerations in test design.
Applying test
Understanding the meanings represented by results from different types of tests. Knowing how to use test results to inform instructional practice (e.g., diagnosing students’ learning status, providing individualized guidance, and optimizing instructional processes), and being able to apply learning theories (e.g., the zone of proximal development) to dynamically adjust assessment content and methods based on assessment results and students’ actual conditions. This also includes the ability to leverage measurement results from digital platforms to support personalized instruction and tutoring.
Skills
Test selection
Being able to select measurement instruments that are appropriate to students’ ability levels based on instructional content and assessment objectives. Being able to choose suitable test types and determine assessment administration procedures according to the assessment context. This also includes the ability to use intelligent assessment systems to select appropriate sets of measurement tools.
Test development
Being able to scientifically develop measurement instruments based on instructional content, assessment purposes, and students’ ability levels. This includes designing multiple-choice items, open-ended items, and contextualized tasks; using digital assessment tools to construct test items and intelligent platforms for automated test assembly; and collaborating with technical specialists to develop interactive tests and technology-enhanced items.
Test administration
Appropriately using measurement instruments, establishing assessment administration procedures, and conducting testing smoothly, while being able to handle unexpected situations to ensure proper implementation. This also includes familiarity with commonly used school-based online assessment systems and the ability to organize students to successfully complete online assessment activities.
Scoring
Being able to develop scoring rules and criteria, and to assign scores to students’ responses in accordance with established scoring standards while ensuring objectivity and fairness. This also includes the ability to use digital tools to improve the efficiency of scoring.
Data processing and analysis
Being able to effectively process and analyze collected assessment data in accordance with assessment purposes, or to use digital assessment platforms for data visualization. This also includes integrating assessment data with students’ background information, learner characteristics, and instructional data, and conducting in-depth analyses to generate evidence for improving teaching.
Data interpretation and application
Understanding the meanings represented by results obtained from different data processing methods and being able to interpret them correctly. Being able to comprehend the various assessment results reported by digital testing platforms. Using diagnostic results to optimize instructional processes, improve instructional design, and develop targeted teaching improvement plans.
Attitudes
Measurement perceptions
Being able to critically recognize the value of educational measurement, understanding its significance for diagnosing student learning and improving teaching, while also being aware of its limitations. Viewing digital and intelligent assessment critically, maintaining the central role of human judgment, and avoiding overreliance on digital or intelligent assessment tools. One should not blindly trust the data.
Measurement wellness
Willing to expand and update one’s knowledge and skills in educational measurement, proactively exploring and innovating assessment methods (including digital and intelligent assessments) to optimize assessment processes and enhance their functions. Actively seeking ways to apply assessment results to improve teaching.
Measurement responsibility
Paying attention to the ethical issues of educational measurement, including data security, personal privacy, fairness and validity of assessments, and the ethics of intelligent algorithms. Viewing measurement results comprehensively and objectively, and reporting them responsibly to relevant stakeholders.
