Abstract
Interest in teachers’ assessment literacy (AL) is growing in China. However, research into AL of senior high school teachers of English as a foreign language (EFL) is generally lacking. The present exploratory research attempts to probe the issue. 257 teacher participants from secondary schools across China responded to a survey of AL predicated on the Standards. Rasch analysis revealed that the teachers met with great challenges in completing the survey. They performed relatively better at items and standards that were closely related to their daily work, for example administering and scoring tests, but worse at the items that were involved some forms of specialized knowledge, for example reliability, performance data interpretation, and so forth. Overall, teachers performed slightly better at test-related items than assessment-related items. The association between AL and demographic variables seemed to be weak. The underperformance of the secondary teachers points to a gap between what was expected of them and what was possessed by them in respect of classroom assessment knowledge on the one hand, and on the other, a potential mismatch between the Standards and the actual competencies required for effective assessment practice in Chinese context. The study has implications for secondary teachers’ AL development and training.
Introduction
Assessment literacy (AL) is important professional knowledge and competency that a teacher is expected to develop (Cui et al., 2025; Stiggins, 1991; Weng & Shen, 2022). Adequate AL could help enhance teaching and learning efficiency, align students’ goals and attributes with classroom activity, identify learner difficulty, and improve academic outcome (Gronlund & Linn, 1990). Meanwhile, teachers should also be aware of the possible negative impacts of invalid, inappropriate assessment practice (Stiggins, 1995).
Interest in research into AL is growing in China, which has developed a highly selective educational system that attaches great significance to tests (Qi, 2004). Assessing student performance consumes considerable time from classroom instruction of teachers of English as a foreign language (EFL). They strongly associated assessment as a means of improving students’ EFL learning’ with test and assessment as a way of cultivating positive moral and personal character contributing to students’ lifelong learning and good citizenship (Z. Gan et al., 2018). A good understanding of the EFL teachers AL is therefore of great significance. However, compared with the western world, research in China paid less attention to seeking for empirical evidence, hence is of limited value for design of AL training program and educational reform (L. Zhang, 2018). The issue is of particular significance for senior high school teachers who undertake more accountability as their assessment practice exerts tremendous impact on students who are engaged in intensive preparation for the college entrance examination (or Gaokao), the most important test in the lives for most senior high learners as well as their parents. The present exploratory research attempts to investigate the AL of this particular instructor cohort whose assessment knowledge is under the joint influence of a set of sociocultural factors, including western assessment theories and practice and local institutional requirements and educational goals.
Literature Review
Teachers’ AL
AL has been included in the pedagogical content knowledge (PCK) as a component of the instructional knowledge base (Shulman, 1986). Stiggins (1991) maintained that assessment literate stakeholders should know what characterizes proper assessment and apply that knowledge to measures of learner achievement. To fulfill their professional responsibilities, teachers became obligated to perform professional assessment and adhere to the ethical principles of assessment, as mentioned in The Code of Professional Responsibilities in Educational Measurement (National Council on Measurement in Education [NCME] Ad Hoc Committee on the Development of a Code of Ethics, 1995). Assessing student performance consumes considerable time from teachers’ classroom instruction. Stiggins (1991, p.7) had pointed out that “teachers spend a third to a half of their professional time involved in assessment-related activities.” Mertler (2009) has it right, “It impacts nearly everything that teachers do” (p.101).
Over time researchers and educators argued that the scope of AL research and training need to be extended from standardized testing to classroom assessment (see DeLuca et al., 2016; Stiggins, 1991). However, studies have persistently revealed teachers’ insufficient AL to carry out proper classroom assessment (e.g. DeLuca et al., 2016; Mertler, 2004; Mertler & Campbell, 2004; Plake, 1993; T.-H. Wang et al., 2008). The gap has engendered much speculation and efforts among educators and researchers on how to improve the situation. An essential resource of this body of work has been The Standards for Teacher Competence in Educational Assessment of Students (hereafter the Standards; AFT et al., 1990). The Standards delineates teacher skills in the following seven areas: Choosing and Developing Assessment Methods; Administering, Scoring, and Interpreting Assessment Results; Using Assessment Results for Decision Making and Grading; Communicating Assessment Results; and Recognizing Unethical Assessment Practices.
This set of standards establishes goals for assessment education and remains an important authority and a seminal framework for research into teacher AL as each standard corresponds to a substantial area of assessment processes. It has also provided the basis for the development of important instruments for AL research across various disciplines and educational stages. These tools continue to be most often cited and used internationally (see DeLuca et al., 2016). One important example is the Classroom Assessment Literacy Inventory (CALI), which consisted of 35 content-based items (five per standard). The instrument has been used in AL research for various subjects, including language assessment.
Worldwide, the testing and assessment of language skills (where the language is a second, third, or additional language) have moved from periphery to center stage (Taylor, 2009). Language AL receives more and more attention from language educators and language assessment researchers. The concept of language AL was developed from discussions on AL in educational assessment in the early 1990s (AFT et al., 1990). It is therefore a subset of AL (L. Gan & Lam, 2022) and has received research attention from the field of language testing and assessment. It refers to the skills, knowledge, methods, techniques and principles needed by various stakeholders in language assessment to design and conduct proper assessment tasks and to make informed decisions according to assessment data (e.g. Fulcher, 2012; Inbar-Lourie, 2013; Taylor, 2009).
More than a decade ago, scholars have identified the gap in the research about language AL, realizing that the field is still “in its infancy” (Fulcher, 2012). Research into language AL has been growing and now the topic is probably past its infancy (Tajeddin et al., 2022). Notwithstanding the improvement, AL remains an area in need of further investigation, especially from a sociocultural perspective, which suggest that AL is a dynamic, developing process, mediated by interactions with other people, by the culture that teachers work in, and affected by deeply rooted beliefs and attitudes. (Meijer et al., 2023).
To find out the areas in need of improvement in instructors’ AL, previous research mainly used questionnaire survey with Likert-scale items eliciting respondents’ perceived importance of or agreement with a certain statement. The problem with this approach is that it tends to produce similar responses from participants. Specifically, respondents are inclined to say that everything presented in the questionnaire items is important or agreeable, resulting in little variation (Fulcher, 2012). The use of survey eliciting teachers’ self-perception is also problematic as it may be influenced by teachers’ over-confidence in their own assessment literacy (Lin & Su, 2015). To avoid the above tendency, researchers chose to use objective survey, for example CALI, which is an important survey based on the Standards as mentioned above and will also be used in the current research.
It should be noted that although the instrument is developed in western context, its use in the Chinese context is feasible based on the following consideration. The assessment theories in the west are well received in China. Specifically, the AL theories originated in the US carry important implications for the AL research in China (Li, 2015; Zhou & Lin, 2025). These theories and related educational practice, including the Standards, in the west are frequently reviewed and resorted to for the development of theoretical framework in most, if not all, dissertations submitted for application for master (e.g. Shao, 2015) and doctor degrees (e.g. Xu, 2017, Zhao, 2014) in the field of education and applied linguistics in China. These studies have a common goal to improve the assessment literacy of Chinese teachers and educators for whom assessment forms an essential part of their entire course of professional development (Li, 2015; L. Zhang, 2018). More specifically, the Standards, and the instruments based on it, provides the basis for development of various research instruments, including questionnaires, surveys, interview protocols, in many academic publications in China. Use of the instruments by Chinese scholars has been reported in PHD dissertation (e.g. Xu, 2017) and published journal articles (e.g. Jiang, 2019; Zheng, 2010) where Chinese school teachers were recruited to provide responses to the survey items. The adapted use of other AL questionnaire surveys developed out of western assessment theories has also been reported in research into English teachers’ AL in China (e.g. Z. Gan et al., 2018). Nevertheless, although applying western AL standards to Chinese teachers is justifiably feasible, the potential tensions in the theoretical frameworks and educational values are hard to avoid between the east and the west.
The above consideration reflects how contextualized the Chinese English teachers’ AL knowledge base needs to be under the influence of the different cultural contexts for assessment, which meshes with the recent conceptualizations of AL as socio-culturally situated practice. Researchers following this line of thoughts maintain that teachers’ language AL development could be reconceptualized as a sociocultural process of language assessment concept formation (Ngo, 2025), and a constant negotiation between teachers’ conceptions of assessment and the macro sociocultural, micro institutional contexts and expected knowledge base (Xu & Brown, 2016).
Studies of Teachers’ AL
Research has been conducted to examine teachers’ levels of mastery of assessment knowledge and the associations with their experience and background, and the effectiveness of AL training program.
Assessment of Teachers’ AL
In order to find out teachers’ strength and weakness in AL, and then contrive AL training plan based on the findings, researchers either probe teachers’ perception through self-report data or use objective survey. The major advantage of the objective instruments is that, as mentioned above, they could project a clear view of the level of the candidates’ assessment knowledge, telling the researchers where the respondents are and where they should head for. For example, in Plake’s (1993) study, the teacher participants answered more than 23 out of 35 items correctly on average in Teacher Assessment Literacy Questionnaire (TALQ), which was also predicated on the Standards. They performed highest on Standard 3 – Administering, Scoring, and Interpreting the Results of Assessments, and lowest on Standard 6 – Communicating Assessment Results. Hardly 30% of the respondents answered correctly on five items, two of which came from Standard 5 – Developing Valid Grading Procedures. About 13% answered correctly an item addressing steps to reliability of test score. The two remaining items with low performance came from Standard 7- Recognizing Unethical or Illegal Practices.
The results of previous studies usually suggest that teachers’ AL was underdeveloped and that their skills in areas of both assessment and testing were limited, including a lack of expertise at test construction (Mertler, 2004), inadequacy in defining criteria and giving feedback (Hasselgreen et al., 2004), and so forth. Williams (2015), for example, investigated the AL of 22 primary school educators with a questionnaire survey developed within mainly the framework of the Standards which revealed these educators’ weakness in designing tests and achieving achievement grading consistency.
In China, Lin and Su (2015) investigated the AL of Chinese secondary English teachers and found that this population lacked sufficient assessment knowledge, for example test authenticity and self-assessment. Z. Gan et al. (2018) investigated the Chinese EFL teachers’ conceptions of assessment and their classroom assessment practices. According to the results, although some level of alignment of teaching and assessment existed in the classroom, EFL teachers considered traditional assessment practices, instead of alternative assessment (which also has western origin with the main theme of Assessment for Learning (Lee, 2007; Liu & Xu, 2017)) such as student self-assessment, as contributing to students’ learning and making individual and school accountable. This, according to the researchers, constituted a matter of concern if innovative assessment reform discouraging the use of mandated external examinations was to be held in the Chinese EFL classroom.
Effects of Experience and Background
Research has been conducted to examine the relation between AL and teacher membership defined by individual characteristics, for example career stages (pre/in-service teacher), teaching experience, and gender, and generated mixed findings.
Mertler (2003) found his pre-service teacher participants answered 19 items correctly on average, while the in-service teachers averaged 22 items, illustrating the need of pre-service teachers that was more urgently to be addressed. According to Mertler (2004), secondary in-service teachers outperformed secondary pre-service teachers on almost every subscale of the Standards. The observation urged the researcher to raise doubt over the effectiveness of assessment training in pre-service teacher education programs.
Xu (2017) adapted TALQ (Plake, 1993) and administered the survey to teachers of college English in China. The results showed that the teachers’ AL performance seemed to be unrelated to their demographic variables, for example genders, years of teaching, professional titles, and so forth, which contradicted the findings made by Jiang (Jiang, 2019), who used CALI, which consisted of the same 35 items in TALQ (Mertler, 2004) with minor changes, to examine the AL of Chinese college teachers. The difference might be attributed to the instrument, which was used differently across the two studies. Xu (2017) retained 21 items from the original survey while Jiang (2019) used all the 35 items. The inclusive findings suggest that more research is needed to explore the relation between the demographic variables and language teachers’ AL with consistent use of research tools. Moreover, in Xu’s (2017) study, there was no connection between teachers’ previous assessment training and their AL level, a finding that is different from what have been made in the studies conducted in western context (see the following section). The difference might be ascribed to, in addition to the inconsistency mentioned above, the immediacy and specificity of training. Xu (2017) explored with questionnaire the relation to prior assessment training, which did not specify any specific training, while the western studies examined the effects immediately after a particular training program or course (e.g. Mertler & Campbell, 2004, 2005; Plake, 1993). The contradiction might also be associated with cultural issues. As CALI was predicated on the Standards originated in the United States, it may incur misunderstanding when being responded to by people from a different culture, for example China, especially for items that may cause tensions between western and Chinese educational values and practices. Therefore, due care should be taken and scrutiny exercised when adapting the instrument items for use to avoid tensions associated with cultural differences.
Effect of AL Training Program
Some AL research evaluated the influence of treatments or intervening variables, where a training course or program was usually involved. For example, in Li et al.’s (2023) research, the results provided evidence for the success of a professional development program in developing primary teachers’ formative assessment literacy in Hong Kong. O’Sullivan and Johnson (1993) evaluated the effect of a graduate-level course in measurement on teachers’ assessment competencies according to the Standards. A statistically significant increase was demonstrated in the teachers’ scores on a measure of assessment competencies. The researchers concluded that the performance-based nature of the course contributed to this score gain in assessment competence.
Similarly, in the Plake (1993) study, in-service teachers who had received coursework/training performed better on a test of AL than those who had not had such experience. The difference was statistically significant, notwithstanding it being less than one score point. In Mertler’s (2009) study on the effect of training, the Assessment Literacy Inventory (ALI) was used to pre-test and post-test teachers, who were also asked to keep reflective journals for recording their experiences. The training proved to be effective for the teachers, as reflected through a dramatic increase in post-test scores over pre-test scores and critical examination of their reflective journals.
Mertler and Campbell (2004, 2005) assessed the AL of pre-service teachers who had just finished a course in classroom assessment. These teachers answered approximately 68% of the ALI items correctly, higher than the records of previous studies. However, in consideration of their recent participation of the coursework, their performance was deemed to be far from satisfactory than might otherwise be anticipated. The researchers attributed the gap between the anticipated and actual AL performance of teacher participants to their limited classroom experience. Doubt was also cast on some TESOL programs for pre-service teachers for their insufficient language assessment contents (Jeong, 2013).
Similar concern was expressed for in-service teachers. In a large-scale survey in which in-service teachers were asked about their perceived level of preparedness to evaluate student learning that could be specifically ascribed to their teacher preparation programs, more than 85 percent of the teachers expressed that they were unprepared (Mertler, 1999). However, when inquired about their current level of preparedness, over half of the respondents believed that they were adequately prepared to assess student learning achievement. Crusan et al. (2016) found that their in-service teacher participants had received only limited language assessment training opportunities. While illustrating the ineffectiveness of the teacher preparation program on the one hand, these results potentially suggested an on-the-job influence on teachers’ development of assessments skills as opposed to structured environments, for example training courses or programs. This suggestion echoes with the inconsistent finding regarding the effects of years of teaching and professional titles in the previous section, which are closely related to on-the-job influence. Given the inconclusiveness of the findings and researchers’ conjecture in this regard, more research efforts are needed. This became part of the objective of the present study.
In contrast to the volume of studies in the west, research into foreign language teachers’ AL is an emerging area in China, and there are very few published studies employing objective surveys to examine the issue. To our knowledge, no research of this kind has been conducted with senior high school English teachers in China. The need for addressing this important void becomes so acute when put under the backdrop of the new reform and development in foreign language education in China which attaches great significance to formative assessment (Liu, 2017, 2018; Q. Wang, 2015; S. Wu, 2019). The present exploratory research attempts to fill the gap by examining the AL of senior high school teachers of English. Previous research has indicated that teachers’ assessment practice and training needs can be influenced by the teachers’ individual characteristics, including career stage and their local instructional context, which can be influenced by institutional mandates and educational policies about teaching and assessment (Yan et al., 2018). Thus, the following two questions are framed to guide the research:
(1) What are the characteristics of Chinese senior high school English teachers’ AL?
(2) What is the relationship between teachers’ AL and their individual characteristics, specifically, career stage and local instructional context?
Methods
Participants
A group of 257 teachers from different schools across China responded to the survey. The participants could be regarded as a convenient and snowball sample as they were recruited through the help of friends and friends’ friends. First, the author contacted an associate of his who was working with senior high school English teachers in a city in northern China. She helped, and asked for help from her associates, to distribute the on-line survey to the 257 teacher participants who gave their responses within 10 days, mostly through mobile phone. Oral consent was obtained from the participants to participate in the research and due care was taken to protect their privacy and confidentiality. Teachers’ career stage and local institutional context mentioned in RQ2 were operationalized as job title and school type. Of this participant group, two teachers held advanced senior job title and one was from a foreign language school. The number is too small to form a demographic subgroup for statistical analysis. Therefore, they were regarded as “outliers” in the grouping variables of job title and school type and were excluded from the follow-up data analysis. The demographic information of the participants is set out in the following two tables (Tables 1 and 2).
Job Title Descriptive Statistics.
Note. N = 254.
School Type Descriptive Statistics.
Note. N = 254.
Instrument
The CALI
The study used the CALI (Mertler, 2003, 2004) to measure what teachers knew about the prescribed competencies and to identify their strengths and weaknesses in AL. In spite of the limited evidence of psychometric property of CALI (Fulcher, 2012), as well as the other available AL measures (Gotch & French, 2014), the CALI is by far the most widely used AL measure. Thus, at present the CALI might be the best instrument for the current research. Developed as application-type questions that are “realistic and meaningful” to teachers’ actual practices (see DeLuca et al., 2016; Mertler, 2004, p.53), the survey items have gone through extensive content validation and pilot testing (Mertler & Campbell, 2004). The survey was translated into Chinese by using back-translations as a way to achieve the accuracy of translations of the English statements into the participants’ mother tongue. Then it was distributed through Wen Juan Xing (问卷星), a popular electronic platform that specializes in questionnaire survey and offers convenient operation on mobile phone in China.
Instrument Validation
To enhance applicability in the Chinese context, the survey items were checked for content validity through review by a panel of four members, including the author and three language assessment experts from Foreign Language Teaching and Research Press, a prestigious publishing company in China. The resulting inventory contained 32 items after 3 items were excluded due to their inappropriateness related to differences in educational and political system between China and the west. For example, in China it is the responsibility of school authority and governmental agencies to take care of the test material, while in America it is kept by testing companies. As a result, item No. 31 was removed. Please see Table 3 for the survey structure.
Alignment of the Standards with Respective CALI Items.
Rasch analysis on the CALI data produced statistical evidence for the validity and reliability of the measure (Table 4) (Dawson et al., 2024; Mendoza et al., 2022). Separation index is an indicator of the measure validity, with the value higher than 2 regarded as acceptable (Linacre, 2011). The value for the CALI survey is 6.44, meaning that the seven standards can be regarded as with six levels of difficulty. This is reasonable and desirable as each of these standards measures a distinct dimension of AL with different numbers of items, and that the survey can place respondents into six groups according to their levels of mastery of AL knowledge.
Standards Measurement Report.
Note. Separation 6.44; Reliability 0.98; Fixed chi-square: 246.6, df : 6, significance: .00.
The reliability index is 0.98, which suggests that the standards are reliably distinguished across different levels of difficulty. The difference between the difficulty of these seven standards is statistically significant (χ2 = 246.6, df = 6, p < .01). The fit indices for the seven standards are within the range of good fit (mean ± 2SD) as proposed by McNamara (1996). They closely cluster around the expected value of 1 within a range of 0.12. These indicate that (1) the rating patterns for each of the seven standards are close to those expected by the Rasch model; (2) in terms of the measurement dimension constructed by the analysis, it makes sense to add the scores from the different standards together; and (3) scores in the standards are making independent contributions to the underlying measurement dimension. In that sense the survey can be said to have been validated (Bond et al., 2020; McNamara, 1996).
Research Design
The review of related literature suggests that most previous AL research employed classical test theory for data analysis. There are very few studies that applied modern test theory such as item response theory (IRT), which conceptualizes the expected performance of individuals on a test item as a function of their ability and the difficulty of the item. This may serve as an advantage over the classical test theory and offer more insight into AL investigation. As a result, data was analyzed based on an IRT model. Specifically, a many-facet Rasch measurement (MFRM) approach using FACETS 3.58 was employed for analyzing the participants’ AL. Four facets were included: participants, job titles, school types, and assessment literacy. The participant facet included the 254 participants. The title facet consisted of three job ranks, entry-level, intermediate, and advanced title. The school facet included two types of school where participants were working: ordinary school and key school. For the assessment literacy facet, either the survey items or the standards went into the analysis. FACETS calibrated the participants, titles, schools, and assessment on the same equal-interval scale (i.e. the logit scale), where higher Rasch measures indicated that participants had higher ability and that the AL items/standards were more difficult. All the items were treated as dichotomous response type in which 1 for correct answer and 0 for incorrect response. Each of the subscale of standards had a full score of 3 to 5 points (Table 3).
Results
The Participants’ Performance on the Survey
The descriptive data about the participants’ performance on the CALI survey is summarized in the following table (Table 5) accompanied by a graphical figure (Figure 1). The straightforward message is that the participants answered approximately14 items correctly on average.
Descriptive Statistics for CALI Performance Data.

Histogram of teachers’ CALI performance.
MFRM analysis results show that the average person ability was −0.40 logits (SE = 0.41), while the average item difficulty was 0.00 logit (SE = 0.15). This corresponded to on average person ability about 0.40 logit below the average item difficulty, which indicated a poor match between the participant’s ability and the questions. Thus overall, the items were more difficult than the participants could handle.
The ability of the participants ranged from −2.40 logits (SE = 0.57) to 0.81 logits (SE = 0.42) extending a 3.21-logit span. The separation index was 1, and the reliability for that separation was 0.50, meaning that the participants belonged to the same group. The chi-square test was statistically significantly (χ2 = 453.8, df = 253, p < .001), which means that though the participants belonged to the same ability group, they were at different levels of AL.
The difficulty of the items covered a 5.03-logit span. The least two difficult items were the first item in the inventory with a measure of −2.29 logits (SE = 0.18) and the 12th item with a measure of −2.02 logits (SE = 0.18), while the 27th and the 7th were the most difficult with a measure of 2.74 logits (SE = 0.30) and 2.30 logits (SE = 0.25), respectively. The reliability index was 0.98, and the chi-square test was statistically significant (χ2 = 1342.6, df = 31, p < .001), indicating that the items were significantly different in terms of difficulty.
The participant’s performance at the standards were also estimated. As the seven standards did not have the same number of categories, the participants’ performance was presented in terms of percentage of correct response and are exhibited in Table 6 and graphically in Figure 2. The standards measurements estimated by FACETS are displayed in Table 4 above, which corroborates the information presented in Figure 2. They show that the participants performed best on Standard 3 and worst on Standard 6.
Participants’ Performance Across the Standards.

Participants’ performance across the standards.
Effects of Demographic Variables
FACETS yielded results about the effect of demographic variables on CALI performance, including job title and school type. As Table 7 shows, the participants holding an intermediate title had the lowest measure of −0.72 logits, while those who had an entry-level title displayed the highest measure (−0.24 logits). Participants with an advanced title exhibited an middle measure (−0.36 logits). The reliability of the measures was 0.00, and the chi-square test was insignificant (χ2 = 1.4, df = 2, p = .50), suggesting that there was little difference among the measures. In other words, job title exerted little effect on participants’ performance in the AL survey.
Job title Measurement Report.
Note. Separation: 0.00; Reliability: 0.00; Fixed chi-square: 1.4, df: 2, significance: .50.
School type also made no significant difference to CALI performance as revealed by the analysis results (Table 8), though the ordinary schools (0.07 logis, SE = 0.30) had a greater measure than the key schools (−0.07 logits, SE = 0.22). The reliability was 0.00 with the chi-square test being insignificant (χ2 = 0.2, df = 1, p = .69).
School Type Measurement Report.
Note. Separation: 0.00; Reliability: 0.00; Fixed chi-square: 0.2, df: 1, significance: .69.
Discussion
RQ1: What Are the Characteristics of Chinese Senior High School English Teachers’ Al?
Overall Performance
Overall the survey results suggest that, similar to findings of previous research (e.g. Lin & Su, 2015), there is much room for improvement in the assessment knowledge of Chinese school teachers of English. In the 32-point full score AL survey, the participants’ average score was far below the generally accepted convention of the cutting point of 60% full score, 19.2 points in this case. Not surprisingly, Rasch analysis revealed a poor match between participant ability and item difficulty. Thus, this cohort of school teachers seemed to possess inadequate knowledge about the seven assessment standards described in the survey, a persistent pattern observed in previous research (e.g. Jiang, 2019; Mertler, 2004; Plake, 1993; Williams, 2015; Xu, 2017). What’s more, the performance of Chinese school teachers in the AL survey was much lower than their western counterparts who responded to the English version of the 35-item survey and had been described as “not adequately prepared” (Mertler, 2009, p.103). However, it should be pointed out that the low scores observed in the present research may not exclusively reflect genuine deficits in the participants’ competency. Instead, such scores could also be attributed to participant homogeneity arising from the convenient sampling. The performance may also be affected by potential limitations in the measurement instrument pertaining to the nature, content, and style of the survey which was developed based on standards of a culture where the educational and sociopolitical system is distinctly different from China, despite the justification of its use raised in the literature review section. .
Areas Posing More Challenges
The participants met with great challenges at some items and standards in the survey. The 27th item proved to be the most difficult. Out of the 257 participants, only 12 (4.67%) gave the correct response. The option D was a strong distractor, attracting 158 hits (61.48%). The reason maybe that, apart from summative assessment which is most often referred to in the Chinese educational context, the other assessment types, for example assessment for learning (Lee, 2007), are seldom heard of among the school teachers, a case in point already put forward by researchers (e.g. Z. Gan et al., 2018). The findings related to the relative difficulties of the questionnaire items are supported by the early observation of Rogers (1991) where teachers have indicated that they cared more about the day-to-day issues related to the application of assessment processes but less about fundamental measurement principles. Without a formal instruction of the related assessment knowledge, it would be unlikely for school teachers to understand the difference of these fundamental concepts.
Item No.7 was the second hardest with only 18 (7%) correct responses. The item aimed at evaluating knowledge about how to improve test reliability. Teachers’ performance on this item was worse than their counterparts in Plake’s (1993) research, where 13% answered correctly. According to Davies et al. (2002, p.168), reliability refers to “the actual level of agreement between the results of one test with itself or with another test.” One of the (three) methods of enhancing reliability is lengthening the test. Most participants (187) thought A was the correct answer: Use a blueprint to develop the test questions. The reason might be that most participants tended to develop test by using the available sources, and they had little experience of writing items. They may think that it would be a rule of thumb to develop the items based on a blueprint, which is more of a method to improve validity rather than reliability though. Teachers may have vague understanding of validity and reliability as well as the difference between them. Being an important dimension of language assessment, the concept of both validity and reliability is difficult to comprehend without sufficient instruction and training.
Standard level analysis indicates that participants met with great difficulty on Standard 6 and Standard 5 (see Table 4 and Figure 2), similar to Mertler’s (2004) observation. Moreover, both Plake (1993) and Campbell et al. (2002) found that teachers scored lowest on Standard 6, which focuses on communicating assessment results. Poor performance at this standard reflected that teachers may have vague understanding in scores. They tended to view scores in their “face value” while ignoring the meaning they bring, a persistent pattern early observed by Stiggins (1991) who wrote: “assessment illiterates accept achievement data at face value and can easily be intimidated by apparently technical information and by a complicated presentation of test scores” (p.535). The point can be illustrated in their response to item 26 (0.27 logits), item 28 (0.77 logits), and item 30 (1.26 logits). Most participants (50.19%) chose B as the correct answer for item 26, which showed that teachers tended to believe higher scores equaled higher ability. They neglected students’ rank in a group, and may have difficulty in cross comparison of different test modules and in linking rank with score. Our findings also suggest that teachers tended to downplay the impact of social factors on students’ scores, for example social economic status, minority groups, etc. (item 27), which reflected the central concern of teachers, administrators, parents, and students about learning outcomes.
The finding concerning the relative difficulty of Standard 6 over Standard 3 corroborated results of prior studies (e.g. Campbell et al., 2002; Plake, 1993), and a synthesis of the previous research done by Brookhart’s (2001), which suggested that teachers performed better at classroom applications than at interpreting test results, an outcome that was likely connected with the nature of their work. The finding also reminded us of the results of the EALTA survey (Hasselgreen, 2013) in relation to the inferred difficulty of teachers in giving feedback, which often entailed communicating assessment results to stakeholders.
The Standard 5 dealt with developing grading procedures. For the hardest item (No.22, 1.85 logits), most teachers chose D (39.3%) while A was the key. As the item was concerned with equating measures of assessment, the result seemed to tell that teachers had difficulty in understanding what “weight” meant in assessment. The participants’ responses scattered over the four options of item 21 (0.34 logits), which involved developing reliable grading method. Overall the difficulty that the participants met at Standard 5 echoed with Brookhart’s (2001) and Mertler’s (2004) observation that teachers lack expertise at test construction. Similar findings also came out of Hasselgreen et al.’s (2004) investigation that defining criteria was identified as one of the most difficult areas even for teachers who were supposed to be language specialists.
Areas Posing Less Challenges
The participants met with less challenges at item 1 (−2.29 logits, SE = 0.18) and item 12 (−2.02 logits, SE = 0.18), which was concerned with selecting appropriate assessments and scoring methods, and for which more than 80% of teachers answered correctly. The finding was in keeping with Plake’s (1993) observation as described in the literature review section. The reason for the above two items being easy might be that they were closely related to teachers’ routine work. For item 1, it might have been a common practice for teachers to put the students at the center in designing the assessment methods. Moreover, the distractors were weak as they contained words like “ease” and “administration” which are often associated with “negative” meaning in Chinese. For item 12, as a reflection of the prominent test culture in China, it may have been a usual practice for Chinese school teachers to grade essays against an “authoritative” scoring rubric, which, in most cases, was a slight variation of the one used in Gaokao.
Participants encountered the “least” challenge at Standard 3 (−0.62 logits), which contained the least difficult item No. 12. Standard 3 reads: The teacher should be skilled in administering, scoring and interpreting the results of both externally-produced and teacher-produced assessment methods. Again, the statement was a realistic reflection of school teachers’ routine tasks. Standard 3 was also scored the highest by the teacher participants in Plake’s (1993) study and the in-service teachers in the Mertler (2004) study, suggesting a common routine practice in the daily work of the teachers across the two cultures. Plake (1993) observed that teachers spent up to 50% of their time on assessment-related activities. While this amount is likely to increase in present time, Chinese school teachers of English may spend most of that amount of time in administering, scoring, and interpreting the test results given the strong test culture in the country. Standard 7 was the second least difficult, whose items were also related to the nature of teachers’ work and some common-sense knowledge. For example, item No. 32 was concerned with learner privacy, item No. 34 was about cheating as inappropriate assessment practice. Hence it was natural that this standard did not pose much impediment to the participants.
To sum up, discussion concerning the challenges posed to the participants need to be couched in theoretical consideration with assessment literacy as situated practice rather than universal knowledge. The above patterns reflected the realities of Chinese teachers’ assessment responsibilities within a test-driven system, where certain standards aligned with institutional demands while others remained peripheral to daily practice. What’s more, the patterns need reconsideration and reanalysis within a theoretical framework that, which is currently not available though, specifies whether all the seven standards should be weighted equally in the Chinese context or whether some represent more critical competencies than others given local institutional requirements and educational goals, while taking into consideration the shift from summative assessment to formative assessment in current educational reform in China.
RQ2: What is the Relationship Between Teachers’ AL and Their Individual Characteristics, Specifically, Career Stage and Local Instructional Context?
FACETS reports of title and school measure suggested that, in keeping with the findings of the Xu (2017) study, neither the participants’ job title nor school type had effect on their CALI performance as the results of chi-square test were not statistically significant (Tables 7 and 8)..
The observation that the type of school, key and ordinary, where teachers were working did not influence their general AL performance might be related to the possible fact that under a centralized system, Chinese schools, regardless of their status, shared similar characteristics with regard to structured environments such as course syllabus and in-service training and workshops. This finding is different from previous research that observed variations across teacher groups from different institutional contexts at tertiary level (Jiang, 2019; Yan et al., 2018). One possible cause to this difference might be attributed to educational level. Compared with middle schools, universities and colleges are more different in terms of institutional context, which can be affected by distinct campus culture, educational mandates, and institutional policies about teaching and assessment.
Job title generally relates to working years and experience which seems to have positive relation with assessment competencies, that is, the more years and more experience teachers have in teaching, the better their AL competence would be. However, this assumption found little support from our findings, which suggested that assessment knowledge did not correlate with working experience or professional seniority for school teachers of English. To some extent, this piece of finding was in keeping with some studies (e.g. Lin & Su, 2015; Xu, 2017), but contradicted what some researchers (e.g. Mertler, 1999; Z. Zhang & Burry-Stock, 1997) had proposed that teachers tended to develop assessment knowledge on the job as well as previous finding with regard to the relative stronger performance of in-service teachers over pre-service teachers (e.g. Campbell et al., 2002 vs. Plake, 1993). Thus, similar to previous studies (e.g., Crusan et al. 2016), our finding casts some doubt on the in-service training that the secondary teachers receive. Another possible cause for the absence of effect across the demographic variables on teachers’ performance might be that the 32 items were treated as dichotomous items, which may result in little variation in the survey responses.
Conclusion
The present research examines the AL of Chinese senior high school English teachers, a “high-stake population” who prepares students for Gaokao. The results revealed the small number of items that Chinese school teachers answered correctly in the survey. They performed relatively better at items and standards that were closely related to their daily work than those involved some form of specialized knowledge, for example reliability, performance data interpretation, and so forth. Overall, teachers performed slightly better at test-related items than assessment-related items. On the one hand, the findings indicated deficits in AL competency, on the other, they might reflect a mismatch between the Standards and the actual competencies required for effective assessment practice in Chinese secondary schools. The selected demographic variables seemed to have little bearing on the participants’ performance on the survey.
Implications
Informed by the recent conceptualizations of AL as socio-culturally situated practice, the current research adapted the CALI for use in Chinese context. With empirical support from FACETS analysis to its validity and reliability, the adapted version of the survey constitutes a potential theoretical contribution to the field. Future research may use this survey to gain more insights into teachers’ AL development for more informed educational and institutional decisions. For the present study, findings gleaned through this adapted instrument hold implications in the following respects. Given the research findings, it is proposed that school teachers reexamine their belief and practice and understand that their own deeply-rooted conceptions of assessment developed from their experiences with classroom assessment, which usually involves standardized tests to evaluates students rather than to inform teaching and learning (e.g. Gunn & Gilmore, 2014; Smith et al., 2014), may constitute a significant barrier to developing AL (Quilter & Gallini, 2000). School teachers should be directed to understand that assessment is an integral part of teaching and learning (McMillan, 2001) and to see the vital connection between them, making assessment more applicable to their daily work and views of education. Essential is training in the theories, concepts, and techniques of classroom assessment, whose effects have been demonstrated in previous research as reviewed in the present study. The essential assessment knowledge should be communicated to the teachers through careful and thoughtful design of courses, programs, and even research projects for which teachers’ participation is encouraged (Lin, 2019; Z. Wu, 2021). At regular intervals, examination and research should be conducted as to how well the assessment knowledge and skills are developed to help teachers better assume the responsibilities for classroom teaching and learning before and after they work with their students.
The limited AL knowledge of teachers revealed in the current research and various other studies prompts us to examine what is appropriate and essential for the AL development of language teachers. It should be noted that school teachers are not supposed to be language assessment experts, nor are they expected to play such a role, especially when thinking of the myriad commitments they have to meet with respect to teaching and training. As such, they should not be anticipated to reach an AL level that is required for assessment experts and to keep abreast of the increasing professionalization of the field which has led to continuous generation of standards, ethical codes, and guidelines for good testing practice (Taylor, 2009). Therefore, before designing teacher training programs and evaluating teachers’ assessment knowledge, we should first reflect on how to set appropriate standard of knowledge and provide sufficient scaffolding that matches teachers’ practical needs in AL based on the specific instructional requirements and educational contexts.
The findings that no significant difference was observed among teachers across different professional titles and different types of school merit attention. On-the-job influence should have been present as conjectured by researchers (e.g. Crusan et al., 2016). Its absence signifies the weak association between AL and EFL teachers’ professional title, which is problematic, as teachers holding advanced professional titles are supposed to be at a higher level of classroom assessment knowledge. Given the importance of AL for teaching, it may be reasonable to suggest that AL competency be evaluated when teachers apply for a promotion in professional title. The same holds true for school types, as key schools generally have higher requirements for recruitment and evaluation of teachers than ordinary schools. In order to justify the attention and resource from the government and society to the key schools, it is suggested that authorities concerned invest more to improve the AL of the EFL teachers in key schools.
Limitations and Directions for Future Research
The study is of exploratory nature and therefore cautions should be exercised in interpreting and generalizing its findings. Much to be improved is left for future endeavors regarding the several limitations that the present research suffers. The first limitation pertains to convenient sampling. It heavily relied on the social connection network of the researcher resulting in sample concentration on some places while leaving other regions and rural areas in China underrepresented. This sampling method also engendered network homogeneity bias. Participants were connected via friends, likely sharing similar teaching philosophies, resource access, and assessment literacy competence, excluding teachers outside this circle. The consequence was that the representativeness of the participants was severely restricted to a sample size that was rather small which eventually limited the confidence on the specific statistical results and restricts the generalizability of the research findings. For example, the entry-level group contained only 59 teachers, raising questions about the stability of estimates and the statistical power to detect genuine effects. These flaws mean that findings cannot be generalized to all Chinese teachers, nor even to broader senior high school English teacher groups beyond the northern network. Findings and implications derived from this restricted sample risk misaligning with teachers’ needs at a more general level, limiting the study’s generalizability as a consequence. Given that China is a vast country, it is suggested that future research use purpose sampling, so that the sample could have balanced coverage over the major regions within the country, which may help improve the generalizability of research. Second, there are only two individual characteristics selected (career stage and school type) for the analysis of their relations to AL knowledge, which, again, led to under-representativeness of the variables of interest and limited the significance of the study. Future research may take into consideration other individual characteristics, preferably psychological characteristics, for example goal orientation, personality, etc. to delve further into the topic. Third, the study solely relied on one survey followed by quantitative analysis to seek evidence, which limited the reliability of the findings and narrowed down the scope of the research. Future research may go beyond survey-only methodology to employ mixed-methods approaches with, for example, observations, interviews, document analyses, or ethnography to obtain data from diversified sources for more sophisticated analysis.
Footnotes
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
