Abstract
The reliability and validity of three curriculum-based measures as indicators of learning English as a foreign language were examined. Participants were 260 Dutch students in Grades 8 and 9 who were receiving English-language instruction. Predictor measures were maze-selection, Dutch-to-English word translation, and English-to-Dutch word translation. Criterion variables were years of English instruction, school level, course grades, and scores on a standardized reading test. Different scoring procedures and time frames were compared. Alternate-form reliabilities ranged from .44 to .88. Significant differences in maze scores were found between school levels but not between years of English-language instruction. Correlations between predictor and criterion variables ranged from .19 to .79. A regression analysis revealed that a combination of maze and English-to-Dutch translation predicted English course grades better than a single measure alone.
Keywords
In foreign language (FL) learning, a new or different language is being learned in a context where the mother tongue is regularly spoken (Verspoor, De Bot, & Van Rein, 2010). Learning a FL is obligatory in many European countries (Eurydice, 2001), where nearly all students, including those with learning disabilities (LD), complete at least one language course prior to graduation from high school. Although in American schools, students with LD often can substitute a FL course with a nonlanguage course (Sparks, Javorsky, & Philips, 2005), these students may once again face difficulties in college where FL learning often is compulsory (Skinner & Smith, 2011).
Learning a FL can present significant challenges for students with LD because their difficulties typically are language based (Hallahan, Lloyd, Kauffman, Weiss, & Martinez, 2005), affecting performance in reading, writing, listening, and spelling (Hallahan et al., 2005). Given adequate help, however, students with LD often can succeed in FL learning (Sparks, 2006). The design of specialized FL programs for students with LD would be enhanced if they were to include methods for evaluating the effects of the program on individual student learning. One such method is Curriculum-Based Measurement (CBM).
CBM is a progress-monitoring system designed to be used by educators for monitoring student progress and judging the effectiveness of instructional programs (Deno, 1985). To date, there has been little to no work conducted on the development of CBM progress measures in FL learning. A small number of studies have examined the technical adequacy of CBM for students who are learning a second language (for an overview, see Sandberg & Reschly, 2011), but these studies have focused mainly on bilingual students who are learning English as a second language at the elementary school level and who are living in an English-speaking country. The results of these studies show promising possibilities for bilingual students (Sandberg & Reschly, 2011); however, these positive results cannot be generalized to students learning a FL. Students learning a FL do not necessarily have the advantage of living in an environment where the FL is regularly spoken and they usually learn the language in secondary school or higher education.
The goal of the present study was to develop CBM progress measures for FL learning for secondary school students. Because little previous research exists to guide the selection and development of measures for FL progress monitoring, we draw from research on the assessment of FL skills and research on CBM progress monitoring in native language reading.
Research on Assessment of FL Learning
Research on the assessment of FL learning has focused on various language domains such as vocabulary/word knowledge, reading, listening, speaking, writing, and grammar (Alderson & Banerjee, 2002). Most research on FL assessment has focused on the domains of vocabulary/word knowledge and reading. Other areas of language proficiency have been researched less often, perhaps because it is more difficult to determine relevant features of those domains (Alderson & Banerjee, 2002).
One of the initial steps in learning a FL is the development of word knowledge (Wang, 2011). Word-knowledge proficiency has been assessed in several different ways. One type of measure is a word-translation measure, where students translate words from the native language to the FL or the FL to the native language. Translation tasks have been found to be more meaningful than other types of vocabulary tasks in the initial period of FL learning because the learner understands foreign words better as translations than as synonyms or descriptions in the FL (Nation, 1982).
Once students have a core of vocabulary words available in the FL, they can begin reading text (Wallace, 2008). Opinions differ as to precisely how students learn to read a FL (for an overview, see Alderson & Banerjee, 2002) and several different measures have been used to assess FL reading skill. One of these measures is a cloze test (Alderson & Banerjee, 2002). In a cloze test, a fixed number of words are deleted and replaced by a blank, and students fill in the blanks as they read (Taylor, 1956). In a modified-cloze test, the deleted word is replaced with a multiple-choice item instead of a blank, and students select the correct word as they read. Each multiple-choice item consists of the correct word and a number of distracters. The distracters are semantically or lexically comparable with the target word (Cranney, 1972–1973). The modified-cloze test has been found to have good reliability and validity as a measure of reading comprehension in English as a FL across nine different language groups (Hale et al., 1989). Reliability coefficients ranged from .79 to .89 (although the authors did not specify what type of reliability was calculated). Correlations between the modified-cloze score and scores on the Reading Comprehension subtest of the TOEFL (Test of English as a Foreign Language) ranged from r = .67 to .78.
Research on CBM Progress Monitoring in Reading
In the CBM research, a measure similar to a modified-cloze has been used to measure reading progress in students’ native language. This measure typically is referred to as a maze or maze-selection task (see Fuchs & Fuchs, 1992; Wayman, Wallace, Wiley, Tichá, & Espin, 2007). Although similar to the modified-cloze used in FL language assessment, there are a few differences between the two measures. First, the CBM maze-selection task is timed. Students work for a fixed amount of time (e.g., 2 min) and the number of correct answers selected in that time is scored. Second, the distracters in the CBM maze-selection task are designed to be clearly incorrect, and are semantically and syntactically different from (rather than similar to as in the modified-cloze) the correct choice (Fuchs & Fuchs, 1992).
Research has supported the use of a maze-selection measure as an indicator of general reading proficiency in the native language (Wayman et al., 2007). At the secondary school level, alternate-form reliabilities for maze-selection ranged from r = .75 to .96, and correlations with other measures of reading performance ranged from r = .76 to .88 (Espin, Wallace, Lembke, Campbell, & Long, 2010; Tichá, Espin, & Wayman, 2009). In addition, maze-selection has been found to produce reliable and valid growth trajectories over time (Espin et al., 2010; Tichá et al., 2009; Tolar et al., 2012).
Selection of Potential FL Progress-Monitoring Measures
Based on the types of measures typically used in the assessment of FL proficiency and in CBM reading progress measurement, we decided to examine maze-selection and word-translation tasks as potential progress measures for FL learning. Because we were interested in developing ongoing progress-monitoring measures, we adopted the maze-selection task used in CBM research rather than the modified-cloze approach used in FL assessment research.
For the word-translation tasks, we included both foreign-to-native and native-to-foreign language translation tasks. A foreign-to-native language translation task requires students to recognize the FL term, and then produce it in their native language. One might expect this task to be easier than the native-to-foreign language task, which requires students to recall and produce the foreign language terms. The two might function differently at different levels of language proficiency. Given that little research has been done on the development of FL progress measures, we examined differences in reliability and validity for various time frames and scoring procedures.
Purpose and Research Questions
The purpose of this study was to examine the reliability and validity of three potential CBM measures as indicators of FL proficiency: maze-selection, Dutch-to-English word translation, and English-to-Dutch word translation. Because this study was one of the first to examine CBM measures for FL learning, we cast our net wide in our first set of analyses, and examined the reliability of different time frames and scoring procedures for each measure. To reduce the overall number of analyses for the subsequent validity analysis, we selected the time frames and scoring procedures that best met the requirements of reliability and efficiency. In our final analyses, we examined whether a combination of maze and word-translations measures predicted English FL proficiency better than a single measure. The following research questions were addressed in the research.
Research Question 1: What is the alternate-form reliability of maze-selection and word-translation tasks, and does alternate-form reliability differ for different scoring procedures and time frames?
Research Question 2: What is the validity of the maze-selection and word-translation tasks as indicators of general FL proficiency in English?
Research Question 3: Does a combination of the maze-selection and word-translation tasks improve the prediction of general FL proficiency in English over the use of either measure alone?
Method
Participants
Participants were 260 (112 male) students in Grades 8 (36.9%) and 9 (63.1%) from a secondary school in an urban city in the Netherlands. The mean age of the students was 15 (SD = 1.21; range = 12–19). The birthplace of participants was The Netherlands (61.8%), Morocco (7.7%), Turkey (7.3%), Suriname (6.9%), and other countries (16.3%). Place of birth for the parents of the participants was The Netherlands (9.1%), Morocco (44.7%), Turkey (20.3%), Suriname (8.6%), Netherlands Antilles (5.1%), and other countries or a different country for each parent (12.2%).
Participants were recruited from English FL courses, with a total of 20 classrooms and 20 teachers. Students in eighth grade were in their 2nd year and students in ninth grade in their 3rd year of English-language instruction. Secondary schools in the Netherlands are organized into different educational levels, from practical (lowest level) to preuniversity (highest level). Different school levels often are housed in different buildings, although it is possible for two to three levels to be combined in one building. Assignment of students to school level is based primarily on students’ grades during their elementary school years and on scores on a test given to students at the end of sixth grade. Participants in our study were from four school levels: very low (27.7%), low (32.3%), intermediate (22.7%), and high (17.3%). The level of instruction in English differed per grade and school level.
Predictor Variables
Predictor variables were the three potential FL progress-monitoring measures: maze-selection, Dutch-to-English word translation, and English-to-Dutch word translation.
Maze task
The maze task was a reading passage in which the first sentence was left intact and thereafter every seventh word was replaced by a multiple-choice item with one correct answer and two distracters (Fuchs & Fuchs, 1992). The maze tasks used in this study were selected from EdCheckup (2010), a system designed to monitor progress in reading for English-speaking elementary school children. The maze passages were developed from narrative texts approximately 265 words in length. Two texts were selected for the study, with topics chosen that were not country specific and the content was appropriate for secondary school students. The passages were selected from fourth-grade level material because it was thought to be easy enough for beginning language learners, but difficult enough to be sensitive to growth. Each passage contained 28 multiple-choice items. Distracters for the passages were modified if necessary so that they were approximately the same length (plus or minus one letter) as the correct word. Median alternate-form reliabilities for the Grade 4 maze passages of EdCheckup (2010) were reported to be above .70. Correlations between the maze-selection tasks and scores on the Measures of Academic Performance (MAP; Northwest Evaluation Association [NWEA], 2003) were .57 (construct validity) and .53 (predictive validity).
Maze scoring
We examined the effects of different time frames and scoring procedures for the maze task. With regard to time frames, scores for both 1 and 2 min were examined. With regard to scoring procedures, four different procedures were examined, each combining two methods used to control for error due to guessing in scoring. The first method was a “rule” versus “no-rule” comparison. In the “rule” condition, a counting rule employed in previous maze-selection research (e.g., Espin et al., 2010; Tichá et al., 2009) was used; after three consecutive incorrect choices, scoring was stopped. In the “no-rule” condition, this rule was not applied and scores included all correct choices that the student made. The second method was a “correct” versus “correct − incorrect” comparison. In the “correct” condition, only the number of correct selections was counted. In “correct − incorrect” condition, the number of incorrect answers was subtracted from the number of correct answers to derive the score. These two approaches were crossed to create four scoring procedures: (a) rule and correct choices, (b) rule and correct minus incorrect choices, (c) no-rule and correct choices, and (d) no-rule and correct minus incorrect choices.
Word-translation tasks
The word-translation tasks required students to translate words from English to Dutch or from Dutch to English. The words represented all parts of speech, where nouns were overrepresented. Fifty words were presented on the left side of a page. Next to each word was a blank for students to write the translated word. Two parallel probes were created for each type of task (English to Dutch or Dutch to English translation). Words for the probes were randomly selected without replacement from the English-language curriculum used in the school. Two levels of the word-translation tasks were created: intermediate and high. Levels were assigned to classes based on teacher recommendation. The eighth-grade very-low school-level students did not complete the word-translation tasks. An overview of task assignments is presented in Table 1.
Overview of Tasks and English Reading Test Levels Administered to Students Broken Down Per Grade and School Level.
Note. “x” indicates what measures the students received. Empty cells indicate that students did not complete the measure.
Word-translation scoring
As with the maze, we examined different time frames and scoring procedures for the word-translation tasks. With regard to time, scores for both 1 and 2 min were examined. With regard to scoring procedures, four different procedures were examined, each combining two scoring methods. The first method was “spelling” versus “no spelling.” For the “spelling” method, the translated word had to be spelled correctly to be counted as correct. For the “no-spelling” method, the word had to approximate the correct spelling of the word to be counted as correct. If the word was read aloud and it sounded like the correct word, it was marked as correct. The second method was a “correct” versus “correct − incorrect” comparison. In the “correct” condition, only the number of correct translations was counted. In the “correct − incorrect” condition, the number of incorrect translations was subtracted from the number correct to derive the score. These two methods were crossed to create four scoring procedures: (a) spelling and correct translations, (b) spelling and correct minus incorrect translations, (c) no-spelling and correct translations, and (d) no-spelling and correct minus incorrect translations.
Criterion Variables
Years of English instruction and school level
Students were either in their 2nd (Grade 8) or 3rd (Grade 9) year of English-as-a-foreign-language instruction, and were in one of four different school levels. School level was on an ordinal scale ranging from very low to high. Students with more years of English instruction or in a higher school level were expected to be, on average, more proficient in English than students with fewer years of English instruction or in a lower school level.
Final English course grades
Course grades were assigned by the English teacher at the end of the school year. Grades were based on performance on various elements of English-language learning: reading, vocabulary, grammar, writing, listening, and pronunciation. Grades ranged from 1 to 10, with 1 being the lowest grade, 6 representing a passing mark, and 10 being the highest grade. Decimal points were possible for grades (e.g., 6.7). Analyses with course grades were conducted within school year (8th and 9th grade) and school level.
Scores on standardized English reading test
At the end of the school year, the English reading subtest of a standardized achievement test (Cito, 2010) was administered by the school. The subtest consisted of short expository passages and multiple-choice questions with three, four, or five possible answers. For each passage, one or two questions were asked. Administration time was approximately 90 min with 40 multiple-choice questions in total.
The technical adequacy of the English reading subtest was available only for an earlier version of the test. (Research on the technical adequacy of the newer version was underway at the time of the study.) The newer version was similar to the older version, differing primarily in content. The internal-consistency reliability of the earlier version was reported as high with Cronbach’s alpha ranging from .87 to .95. Validity was established by examining differences across grades and school levels. Mean scores were found to increase both by grade level within school and by school level within grade (Nederlands Instituut van Psychologen, 2011).
The English reading subtest consisted of three different levels for each grade. Scores on the English reading subtest were provided to schools as both standard and percentile scores. Only the percentile scores were available from the school; thus, comparisons were done within English reading test level and grade. An overview of the English reading test levels for the students who completed the standardized English reading test is provided in Table 1.
Procedure
The maze and word-translation tasks were administered in March in a group setting by the classroom teachers, who received training in using the correct administration procedures. The training was approximately 1.5 hr, and included a description of the theoretical background of CBM and the progress measures, and practice in implementation. Some teachers administered both the maze and word-translation tasks in one session; others administered the maze in one session and the word translation in another; however, all tasks were administered within the same week. The duration of the administration was approximately 15 to 20 min per classroom.
All students first completed the two maze tasks. For each task, they silently read the text, circled choices during reading, made a slash at 1 min to mark their progress, and stopped working at 2 min. In the same or second session, the students completed four word-translation tasks. For the first two probes, (English to Dutch), students translated words from English to Dutch. For the following two probes (Dutch to English), students translated words from Dutch to English. For all four word-translation probes, students circled the last word they had read after 1 min, and stopped working after 2 min. All tasks were preceded by an example exercise. All students received the same maze tasks, but received the word-translations task at either the intermediate or high difficulty level (see Table 1). The order for parallel forms of each task was counterbalanced across classrooms.
To check fidelity in administration, teachers’ first administration of all tasks was observed by trained graduate students. Training of the observers took approximately an hour and consisted of a description and demonstration of the administration procedures and a practice observation with the trainer. The students then observed each other administering the task.
In all observed classes, the instructions were read clearly by the teacher and students were judged by the observers to understand the instructions. Five classrooms (20%) were removed from the sample because of incorrect timing. Demographic information for the students and scores on criterion variables were obtained from the school at the end of the school year.
Scoring Accuracy
The maze and word-translations tasks were scored by graduate students who were provided approximately 1.5 hr of training. For each type of measure, one probe was scored together, then two probes were scored individually and discussed. Scoring accuracy for the individually scored probes had to be above 90% for the maze and above 80% for the word-translation tasks before scorers could continue scoring. All scorers reached this level of accuracy on their first attempt.
During scoring, every 20th maze task for each scorer was double scored. Scoring accuracy for the maze was 99.93% (range = 94.1%–100%). For the word-translation task, every 10th task for each scorer was double scored. Scoring accuracy for the English-to-Dutch translation was 97.71 (range = 89%–100%) with spelling and 97.43 (range = 85%–100%) without spelling. For Dutch-to-English translation, the scoring accuracy was 96.79 (range = 89%–100%) with spelling and 96.13 (range = 82%–100%) without spelling.
Data Analyses
To address alternate-form reliability, Pearson correlations were calculated between the two parallel forms for each measure, time frame, and scoring procedure. For each measure, the time frame and scoring procedure with the highest reliability were used in subsequent validity analysis.
To determine validity, two types of analyses were conducted. First, for the maze, where the same maze passages were used for all participants, a two-way ANOVA was conducted to examine differences in grade and school level on the maze task. Differences in mean score by grade and school level would support the validity of the measure as an indicator of performance. Second, for all measures, Pearson correlations between progress measures and English course grades and the standardized English reading test were calculated within subgroups. Given that sample sizes were small for this analysis, we viewed the analysis as exploratory, and focused on the general pattern of results and the magnitude of the correlations. Finally, two forward stepwise regression analyses were executed on two subsamples to determine whether the combination of the maze and word-translation tasks would improve the prediction of English reading and language proficiency over the use of a single measure alone.
Results
For the maze task (see Table 2), means and standard deviations were similar across Forms A and B. Students made approximately 9 to 9.5 choices per passage in the first minute, and 7 to 8 choices in the second minute. The mean scores were higher for correct than for correct minus incorrect scores, with an average of 2.5 to 3 incorrect choices per passage in 2 min in the rule condition and 5 to 5.5 in the no-rule condition. Within the correct and correct minus incorrect scoring methods, mean scores for the rule versus no-rule scoring approaches were similar, suggesting that students did not do much guessing while completing the maze task. Ceiling effects were found for ninth-grade students in the high-level school.
Means and SDs for the Maze Scores, Form A, Form B, and Combined Score (Sum of Forms A and B).
Note. “C” is the number of correct choices. “C-I” is the number of correct minus incorrect choices.
Results for the word-translation tasks are reported in Table 3 (intermediate difficulty level) and Table 4 (high difficulty level). For the intermediate level, scores were somewhat different for Forms A and B, with consistently lower scores seen for Form B. Examination of the combined scores across Forms A and B (last column) reveals that students tended to make more correct translations in the first minute (approximately 11 to 14) than in the second (approximately 7.5 to 9). Mean scores for the spelling and no-spelling rules were similar in the English-to-Dutch version, revealing that students made few spelling mistakes in their native language; however, in the Dutch-to-English version, observed scores for the spelling and no-spelling methods were different with students making on average 4.65 spelling errors across the two probes in 2 min (mean score of 22.83 for no spelling vs. 18.18 for spelling). The number of correct translations in 2 min was larger for the English-to-Dutch version than for the Dutch-to-English version (approximately 22 vs. 18), but only when spelling was taken into account. Scores for correct − incorrect were substantially lower than for correct only.
Means and SDs for the Word-Translation Scores, Intermediate Difficulty Level, Form A, Form B, and Combined Scores (Sum of Forms A and B).
Note. “C” is the number of correct translations. “C-I” is the number of correct minus incorrect translations.
Means and SDs for the Word-Translation Scores, High Difficulty Level, Form A, Form B, and Combined Scores (Sum of Forms A and B).
Note. “C” is the number of correct translations. “C-I” is the number of correct minus incorrect translations.
For the high-difficulty level probe, scores tended to be similar across Forms A and B (see Table 4). As with the intermediate level, examination of the combined scores across Forms A and B (last column) reveals that students tended to make more correct translations in the first minute (approximately 7.5 to 13) than in the second (approximately 6 to 7.5). Mean scores tended to be somewhat lower for the spelling than the no-spelling rule, especially for the Dutch-to-English version, where, similar to the intermediate level, students made on average 4.68 spelling errors across the two probes in 2 min (means score of 20.80 for no spelling vs. 16.12 for spelling). The number of correct translations was lower for the English-to-Dutch version (approximately 13.5 with spelling vs. 16 without spelling) than for the Dutch-to-English version (approximately 16 with spelling vs. 21 without spelling). As with the intermediate level, scores for correct − incorrect were substantially lower than for correct only.
Alternate-Form Reliability
Alternate-form reliability coefficients are reported in Table 5. Correlations ranged from r = .44 to .88. Across all measures and scoring approaches, reliability increased with probe duration. Thus, in discussing the table, we focus on the coefficients for the 2-min probes. For the maze tasks, reliability coefficients tended to be higher for the correct minus incorrect than correct-only scoring procedures, but were similar for the rule versus no-rule scoring procedures. Coefficients for correct minus incorrect were .78 for both rule and no-rule methods.
Alternate-Form Reliability of Forms A and B of Maze, Dutch-to-English, and English-to-Dutch Translation.
Note. All correlations were significant at <.001 level. “C” is the number of correct choices on the maze or translations on the word-translation task. “C-I” is the number of correct minus incorrect choices on the maze or translations on the word-translation task.
For the word-translation tasks, similar patterns were found across the two difficulty levels. For both levels, scoring the number of correct translations resulted in stronger reliabilities than scoring correct minus incorrect. Within the correct scoring method, few differences were seen for spelling versus no-spelling rules. Finally, reliability coefficients tended to be higher for the Dutch-to-English (r = .65–.88) than the English-to-Dutch versions (r = .59–.80), but differences were more evident for the intermediate- than for the high-difficulty level tasks.
To reduce the overall number of statistical tests for the subsequent validity analysis, the scoring procedures and duration with the highest reliabilities for each measure were selected to carry forward. If reliabilities were similar across scoring procedures and duration, then ease, efficiency, and previous research were taken into consideration in making the selection.
For the maze task, the number of correct minus incorrect choices made in 2 min was selected. Reliability coefficients were the same across the rule and no-rule conditions. Although the maze is easier to score without the use of a three-in-a-row rule, in this case, we elected to use the scoring rule because in most previous maze research, this rule has been implemented. Use of the rule allowed us to compare our results with those of previous research.
For the word-translations task, the number of correctly spelled translations in 2 min was selected. Although reliabilities tended to be somewhat higher for the Dutch-to-English than the English-to-Dutch version, we maintained both versions for the validity analysis because these measures have not been considered in previous research as CBM progress measures. Reliabilities for spelling and no-spelling scoring methods were similar, but scoring correctly spelled translations involves less judgment and is thus less time-consuming; thus, we selected scoring correctly spelled translations.
Validity
To examine the validity of the measures, correlations between the potential progress measures and English course grades and percentile scores on the standardized English reading test were examined. In addition, for the maze-selection task (which was the same across all participants) differences in mean scores for students in their 2nd versus 3rd year of English and across very-low to high school levels were examined. Combined scores across Forms A and B were used for all analyses.
Mean differences in maze-selection
Mean score differences in maze performance between more and less proficient groups (in our case, students with more or fewer years of English-language instruction or students in higher and lower school levels) would support the validity of the maze measure as an indicator of general FL proficiency. We conducted a two-way ANOVA to determine whether there were significant differences between grade- and school level. Recall that students in eighth grade were in their 2nd year and students in ninth grade in their 3rd year of English FL instruction. Means and standard deviations broken down by grade and school level are presented on the left side of Table 6. A main effect was found for school level, F(3, 252) = 24.38, p < .001, but not grade, F(1, 252) = 1.67, p = .20. There were no interaction effects, F(3, 252) = .15, p = .93. Post hoc analyses of school-level differences revealed that there were significant differences between each adjacent school level, except for between intermediate and high levels. Regarding grade, although differences were not statistically significant, mean scores tended to be higher for ninth- than for eighth-grade students.
Means and SDs Second Minute Scores of Maze, Dutch-to-English, and English-to-Dutch Translation Broken Down Per Grade and School Level.
Note. Empty cells indicate that students did not complete the measure.
Word translation: Intermediate difficulty level.
Word translation: High difficulty level.
Correlations with English course grades
In the second set of analyses, scores on the maze-selection and word-translations tasks were correlated with end-of-the-year English course grades. This analysis was conducted within grade and school level because the relative meaning of course grades differs across grade and school levels. Splitting the sample by grade and school level resulted in small samples, ranging from 14 to 48 participants per group.
Means and standard deviations for the maze and word-translations tasks broken down by grade and school level are presented in Table 6. Means and standard deviations for the criterion variables broken down by grade and school level or reading test level are reported in Table 7. English course grades ranged from 3.0 to 8.8 with a mean score of 6.29. Percentile scores on the reading test ranged from 0 to 100, with mean scores per test level ranging from approximately 19.5 to 66.5 percentile.
Means and SDs of English Course Grades and Percentile Scores on English Reading Test, Broken Down in Grade- and School Level or in English Reading Test Level.
Note. — = no information was available for reading test level.
Pearson correlations between each progress measure and English course grades are reported in Table 8. Correlations between the maze tasks and course grades ranged from r = .20 to .79, with all but one correlation above .40. Correlations between the word-translation tasks and course grades ranged from r = .44 to .77. Correlation coefficients tended to be similar across the three measures, although patterns varied somewhat by school level and grade.
Correlations Between English Course Grades and Maze and Word-Translation Tasks.
Word translation: Intermediate difficulty level.
Word translation: High difficulty level.
Percentile scores on the standardized English reading test
In the third set of analyses, correlations were calculated between scores on the progress measures and percentile scores on a standardized English reading test. The English reading test consisted of six possible reading test levels. At eighth grade, very-low schools administered Level 1, low and intermediate schools administered Level 2, and high schools administered Level 3. At ninth grade, very-low schools administered Level 4, low and intermediate schools administered Level 5, and high schools administered Level 6. The scores for ninth-grade very-low schools were not available. Because of the difference in test levels, and because only percentile scores (and not standard scores) were available from the school, comparisons could only be made within reading test level, resulting in small samples ranging from 20 to 27 participants per group. The results are presented in Table 9.
Correlations Between Percentile Score on English Reading Test, Per Reading Test Level, and Maze and Word-Translation Tasks.
Note. — = no information was available for either reading test level or progress measure.
Word translation: Intermediate difficulty level.
Word translation: High difficulty level.
Correlations between the maze scores and the reading test ranged from r = .40 to .78. Correlations between the reading test and the Dutch-to-English word-translation task ranged from r = .37 to .65 and for English-to-Dutch translation, from r = .19 to .75. Correlations tended to be higher for the maze task than for either of the word-translation tasks.
Regression analysis
In the final set of analyses, a regression analysis was conducted to examine whether a combination of maze and word-translation tasks would improve the prediction of English course grades over using either measure alone. If measures were to be given at the beginning of the school year to identify students who might experience difficulties in language learning, then it would be important to know whether two measures would improve prediction over one measure alone. Students from ninth-grade very-low (n = 43) and low (n = 48) school levels were selected for the regression analyses. These groups had the largest sample size, and represented both levels of the word-translation task (intermediate level for the very-low group and high level for the low group).
A separate forward stepwise regression was conducted for each group in which the three predictor measures were entered into the equation (maze, English-to-Dutch translation, and Dutch-to-English translation). Given the necessity of doing the analysis with small samples, we consider the analysis exploratory. Results can be used to guide future analyses with larger sample sizes.
Results of the regression analyses are presented in Table 10. For the very-low group, maze entered first into the equation and was a significant predictor of English course grades, explaining 60% of the variance in course grades. Adding the English-to-Dutch word-translation task significantly added to the prediction of course grades over the maze task alone, with the explained variance increasing to 66%. The Dutch-to-English word-translation task did not enter into the equation.
Relative Contributions of Maze and Word-Translation Tasks in Stepwise Regression Analyses on English Course Grades, n = 43 (Ninth-Grade, Very-Low School-Level Students) and n = 48 (Ninth-Grade, Low School-Level Students).
Note. WT En-Du = Word-Translation English to Dutch; WT Du-En = Word-Translation Dutch to English.
Word translation: Intermediate difficulty level.
Word translation: High difficulty level.
For the low group, the English-to-Dutch translation task entered first into the equation, and was a significant predictor of English course grades, explaining 41% of the variance. Adding the maze task significantly improved the prediction of course grades over the word-translation task alone, with the explained variance increasing to 49%. The Dutch-to-English word-translation task did not enter into the equation.
In sum, the results of the regression revealed that for both the very-low and low groups, the use of two measures improved prediction over the use of one measure alone. The pattern of results were similar for the two groups, with maze and English-to-Dutch translation contributing to the prediction of course grades, but the order differed for the two groups with maze entering first for the very-low group, and English-to-Dutch word-translation entering first for the low group. The overall predictive power of the two measures was less for the low group than for the very-low group.
Discussion
The purpose of this study was to examine the technical adequacy of potential progress-monitoring measures for FL learning. Maze-selection and word-translation tasks were examined. Both emerged as potential measures for FL progress monitoring.
The first research question addressed the alternate-form reliability of the measures. If CBM measures are to be administered on a repeated basis to reflect growth, a necessary (but not sufficient) requirement is that alternate-forms be reliable. For each measure, differences in reliabilities for various scoring procedures and time frames were investigated. For maze-selection, four scoring procedures were examined, each consisting of two approaches: a rule versus no-rule approach and correct versus correct minus incorrect approach. In addition, 1- and 2-min were compared. Results of the alternate-form reliability analysis supported the use of a 2-min probe scored for the number of correct minus incorrect choices. Few differences in reliability coefficients were seen between the rule and no-rule approaches.
The cessation of scoring following three incorrect choices (rule condition) is done to help to control for guessing; however, as evidenced by the similarities in mean correct scores between the rule and no-rule scoring conditions (see Table 2), it would seem that students in our sample did not do much guessing. Given that scoring is easier without the use of a three-in-a-row rule, we could suggest scoring without use of a scoring rule; however, there are two points to consider before abandoning the rule completely. First, our study is one of the first to examine CBM progress measures in FL learning. It would be wise to replicate these findings before abandoning the scoring rule altogether. Second, it may be that the scoring rule is important for a small number of students who struggle with FL learning or who do not always complete the probes carefully. Such small numbers may not affect group mean scores or alternate-form reliability results, but would affect the individual growth rates produced via repeated administration of the measures. Thus, at this moment, we suggest that the scoring rule be maintained until further research is conducted.
Similar to previous research on the maze in native language at the secondary school level, reliabilities in our study increased with time (Espin et al., 2010; Tichá et al., 2009); however, different from this previous research, our reliability coefficients were consistently below r = .80 (range of r = .69–.78 compared with r = .79–.90), and reliabilities for correct minus incorrect were larger than for correct only. These differences are most likely related to the fact that previous research was conducted with students reading the maze tasks in their native language rather than in a FL. The average number of errors found in maze research in the native language is approximately one error in 2 min (Espin et al., 2010), whereas in our study, students made on average 2.5 to 3 errors (per passage) in 2 min. It would seem that not only fluency but also accuracy contributes to the stability of measurement in FL learning.
The reliability coefficients are lower than those typically found for CBM measures, where coefficients typically exceed r = .80. Given that a consistent finding in previous research on maze is that reliabilities increase with probe duration, it would be prudent in future research to examine reliabilities for a 3-min probe. It is especially important that reliability be higher if the measures are used as screening measures to identify students who might experience difficulties with FL learning. If administering repeated measures over time, the current reliabilities might be acceptable, especially if slopes are drawn after collection of 10 to 12 data points. In addition, in other content areas such as science (see Espin et al., 2013), there is evidence that reliability of word-knowledge measures increases as knowledge in the content area increases. Thus, it may be that if measures were to be administered over time, alternate-form reliabilities would increase.
For the word-translation tasks, four scoring procedures were also examined, each consisting of two scoring approaches: spelling versus no spelling and correct versus incorrect. In addition, 1 and 2 min were compared. Finally, two versions of the word-translation measure were compared: Dutch-to-English and English-to-Dutch. Results of the alternate-form reliability analysis supported a 2-min Dutch-to-English translation probe scored for the number of correctly spelled translations.
The second research question addressed the validity of the measures. To reduce the overall number of analyses, only a select number of measures were considered for the validity analysis. For the maze-selection, the number of correct minus incorrect choices made in 2 min was examined. Based on the fact that previous research typically has made use of a three-in-a-row scoring rule in maze, scores calculated with the scoring rule were used for the validity analysis to allow for comparison with previous research. For the word-translations task, the number of correctly spelled translations in 2 min was examined. Both the Dutch-to-English and English-to-Dutch versions were considered in the validity analysis.
The patterns that emerged from the validity analysis provide tentative support for both the maze and word-translation tasks as indicators of performance, especially for students in their 3rd year of English (i.e., ninth-grade students). The majority of correlations for students in their 3rd year of English ranged from r = .60 to .78. For students in their 2nd year of English (i.e., eighth-grade students), correlations tended to be lower, ranging in general from r = .42 to .79. This pattern was evident for both the criterion variables of course grades and standardized test scores. There was no consistent pattern across groups with regard to the relative strength of the correlations for maze versus word translations or for Dutch-to-English versus English-to-Dutch translation. Patterns differed by group, and given the fact that groups were small, it is difficult to draw conclusions about the relative validity of one measure versus the other.
For the maze-selection task, it was possible to compare mean scores across years of English-language instruction and across school levels. Mean scores on the maze task were higher for students in their 3rd year of English (ninth grade) than in their 2nd year of English (eighth grade), but these differences were not significant. However, significant differences were found between school levels in the expected direction, with students in higher school levels scoring higher on the maze task than students in lower school levels.
In sum, results of mean difference scores among groups and of correlations between the CBM and criterion measures provide tentative support for the use of CBM measures as general indicators of FL performance. Data were more positive for students in their 3rd year of FL learning than in their 2nd year. It may be that performance becomes more stable in the 3rd year of learning a FL, and is therefore easier to sample with a brief probe of performance. Also, the criterion variables used in this study, course grades and the standardized reading test, had unknown technical adequacy. In any case, before recommending use of such measures for beginners in language learning, it is necessary to replicate the results with larger samples and with criterion variables with good technical adequacy for the validity analysis.
In our final set of analyses, we examined whether combining two measures predicted performance in English FL learning better than a single measure. The sample sizes were small, so we view the results as suggestive only, and recommend that they be followed up in future research. In general, results revealed that a combination of measures predicted better than a single measure, and that both a reading-related task (maze) and a word-knowledge task (word translation) contributed to the prediction of course grades, with the addition of a second variable accounting for an additional 6% to 8% of the variance. Furthermore, for both groups, the English-to-Dutch word-translation task contributed to the prediction of course grades, but not the Dutch-to-English task, although in a different order with maze entering the equation first for the very-low group and second for the low group.
Both the maze and English-to-Dutch word-translation tasks require recognition rather than production of English words. It may be that at lower levels of learning a FL (such as was the case in our two samples), recognition tasks differentiate students better than production tasks. If we had had been able to conduct a regression with higher performing students, we may have obtained a different pattern of results. It is also important to keep in mind that the English-to-Dutch version of the word-translation task had lower alternate-form reliabilities than did the Dutch-to-English version, especially for the intermediate level task. Thus, at this point, we cannot recommend that one translation task is better than the other.
Conclusion
In conclusion, our results provide tentative support for the use of three potential measures for progress monitoring in FL learning: the number of correct minus incorrect choices in 2 min on a maze-selection task and the number of correctly spelled translations in 2 min on a word-translation task (English to Dutch, Dutch to English). For screening purposes, using both a reading and word-translation measure resulted in better prediction than either measure alone. We found no clear pattern of differences between Dutch-to-English or English-to-Dutch translation tasks and recommend that this be explored further.
This study is one of the first to explore the development of CBM progress measures for FL learning. Many questions remain unanswered. First, it is important that the results be replicated. Second, it might be interesting to investigate the technical adequacy of the progress measures with criterion variables in which different aspects of language proficiency (e.g., vocabulary/word knowledge, reading, listening, speaking, writing, and grammar) are represented. Third, once reliable and valid measures are identified, it is important that their technical characteristics as growth measures be examined. Most important is to examine whether teachers use progress data to inform instructional decision making and whether data use leads to improved student achievement in FL learning.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partly supported by a Subsidiëring Landelijke Onderwijsondersteunende Activiteiten (SLOA; National Education Funding) grant (2010–2013) from the Voortgezet Onderwijs Raad, Netherlands.
