Abstract
Vocabulary skills are important for overall reading competence, but vocabulary assessment approaches that inform instructional decision-making and are sensitive to improvement are limited. This article describes a process for developing vocabulary measures designed to facilitate data-driven decision-making for kindergarten and first-grade students who are at risk in vocabulary. A pilot study suggested the measures could be administered and scored with fidelity, and also produced promising data for indices of reliability, criterion-related validity, and sensitivity to growth, particularly for a rating-based scoring metric. Implications and considerations for developing instructionally relevant vocabulary measures are discussed.
Keywords
A strong vocabulary is an important part of school success (Lesaux et al., 2010; Quinn et al., 2015; Stahl & Nagy, 2006), but too few students have sufficient vocabulary skills. At-risk students typically begin school knowing fewer words and tend to increase their vocabulary at slower rates than students with fewer risk factors (Anderson & Nagy, 1992). They are also less likely to benefit from evidence-based interventions (Marulis & Neuman, 2010).
Fundamental to more effective vocabulary intervention is the need for instructionally relevant and technically adequate measures of vocabulary. Vocabulary researchers recognize that, “If effective instruction is to become commonplace, we must first address the vexing question of how we assess vocabulary knowledge and growth” (Pearson et al., 2012, p. 231). Knowing how well students respond to effective instruction is a precondition to making beneficial adjustments to instruction, but vocabulary can be assessed in many ways that do not inform how well students respond to instruction. For example, standardized norm-referenced tests of vocabulary are popular, well-constructed, and can be very good overall indicators of vocabulary knowledge (e.g., Peabody Picture Vocabulary Test; PPVT). Such tests provide meaningful point-in-time comparisons of performance among students of the same age, but tend to be so broad that they are not sensitive to the impact of vocabulary intervention (National Reading Panel, National Institute of Child Health and Human Development, 2000).
By contrast, measures from a curriculum-based measurement (CBM) paradigm have an extensive research base that includes progress-monitoring to determine response to intervention (Deno, 2003). However, while CBMs have been created to assess the vocabulary of older students in content domains like science (e.g., Espin et al., 2013), few exist to directly assess the vocabulary improvement of young at-risk students. One challenge in applying conventional CBM approaches to vocabulary assessment is that the content domain is less constrained and less amenable to rate-based administration approaches compared with other CBMs of academic skills (Silberglitt, Parker, & Muyskens, 2016). For young students developing knowledge within a large corpus of concepts and ideas, it may therefore be necessary for vocabulary assessments to focus on the words targeted during instruction or intervention (Coyne et al., 2015). In existing research, such an approach was more sensitive to intervention effects than standardized, norm-referenced measures (Elleman et al., 2009; Marulis & Neuman, 2010), but few published studies describe the development and technical adequacy of approaches to assessing vocabulary for this purpose, particularly with young at-risk students receiving intervention. In this article, we describe a process to develop feasible, instructionally relevant vocabulary measures and present a pilot study of the resulting measures’ technical adequacy.
Measure Development Process
Context and Considerations
Our measure development process was organized around educational assessment frameworks that recognize the developmental nature of student knowledge, the need to ensure measurement practices are feasible and aligned with instruction, and the need for high-quality data (e.g., Wilson, 2009). With respect to the developmental nature of student knowledge, we recognized that our measures were intended to inform relatively low-stakes educational decisions about whether at-risk kindergarten and first-grade students acquire vocabulary taught during an instructional or intervention context. Other interpretations and uses of vocabulary data, such as determining overall vocabulary competence or making high-stakes decisions about needing more specialized instruction, were outside the scope and purpose of this measure development process.
With respect to feasibility, the measures needed to be appropriate for students in kindergarten and first grade. Children in early elementary grades have limited attention spans as well as limited experience with educational assessments. Thus, the measures needed to be relatively brief and use formats that would elicit vocabulary knowledge from kindergarten and first-grade students. Previous research used expressive, sentence-based response formats that prompted measureable responses and demonstrated utility for using the data for purposes of screening and documenting growth over time (Boomer & Stewart, 2016; Kaminski et al., 2007). Such research formed the basis for the current measure. Similarly, extant research was used to determine that the assessment setting would occur in a one-on-one context to obtain and hold the attention of students and maintain their engagement, increasing the likelihood of more accurate scores (Snow, 2011).
The measures also needed to be feasibly administered by assessors from various backgrounds—including teaching assistants, paraprofessionals, and volunteers (Parkes, 2013). As a result, administration procedures needed to be simple, unambiguous, and standardized. Scoring rules also needed to be well-defined with clear scoring protocols that included examples of student responses. Such administration characteristics were designed to improve administration feasibility, and by extension increase the likelihood of stronger overall fidelity across administrators.
With respect to technical adequacy, the measures needed to provide data that accurately and meaningfully measured student knowledge. Consistent with a classical test theory perspective, the measures needed to assess student knowledge of word meanings in a way that produced defensibly reliable and valid data (Salvia et al., 2017). They also needed to be sensitive to learning gains (Polikoff, 2010), because sensitivity to improvement is critical if the measures are to be used to inform decisions about vocabulary improvement.
Selection of Formats
We selected formats based on prior empirical work in vocabulary measure development. There are a variety of ways in which children can demonstrate knowledge of a word (Pearson et al., 2007). Within our specific context and considerations, we decided to assess vocabulary by having students use target words in a sentence. In previous research, at-risk kindergarten and first-grade students were able to provide measurable responses on a Word Use Fluency (WUF; Good et al., 2004) task in which the students were given a word and asked to use it in a sentence. Using the construct of a sentence enabled students to provide a succinct response that allowed the administration to be brief and enhanced ease in scoring, and produced data that correlated moderately to highly (.55 < r < .71) with other general measures of language development (e.g., the Test of Language Development, language sample) and had moderate-to-strong alternate-form reliability (.65 < r < .90). In this research, sensitivity to growth was observed as a statistically significant increase of 0.46 words per week (Good et al., 2004). Subsequent research incorporated a rating metric to assess student understanding, in which full and coherent responses were given complete value, partial responses were assigned partial value, and entirely incoherent responses were assigned no value (Kaminski et al., 2013). Corresponding analyses showed the resulting data correlated moderately with highly (.44 < r < .71) with the Test of Language Development and the Clinical Evaluation of Language Fundamentals and demonstrated growth across the school year for students in kindergarten through third grade.
These findings provided evidence that a format based on sentences holds promise for the purposes of assessing kindergarten and first-grade vocabulary skills. The final format and administration directions for our first-grade measure, Words in Sentences (WIS), were based largely on a revised form of the WUF measure (Kaminski et al., 2013). Yet, while WUF demonstrated technical adequacy for use with young elementary school students, previous research identified potential floor effects with kindergarten students at the beginning of the year (Kaminski et al., 2007, 2013). As a result, although the two measures were very similar (see Supplemental Material), the kindergarten assessment, Tell Me, was based on a simpler open-ended prompt that found promising results when used with 5-year-old preschool students in the spring (Boomer & Stewart, 2016; Kaminski et al., 2014).
Selection of Words
Given the purpose of the assessment, the process of selecting assessment words began with the vocabulary intervention that provided the context for this project. The intervention employed repeated read-aloud procedures to teach preselected words from at least 20 age-appropriate books per grade (Biemiller & Boote, 2006). Criteria for selecting books included being written for an early elementary audience, readability within 10 to 20 min, inclusion of a fictional or nonfictional storyline (to ensure a variety of words), and subjects or authors from different cultures. Intervention target words within these books were identified based on their utility for understanding the story and on their potential for improving vocabulary knowledge overall (Biemiller, 2005). For kindergarten, a total pool of 240 words was taught during intervention, including 12 words from each of 20 books. For first grade, 260 words were taught during intervention, consisting of 10 words from each of 26 books.
Given the number of intervention words, it was not practical to assess students’ knowledge of all words. Instead, we used a sampling procedure that involved assigning a prevalence level to each word from the total pool of intervention words. To establish prevalence levels, each word in the total word pool was cross-referenced with 10-word lists that curated words for early learning according to preestablished criteria (see Supplemental Material). If a word was on a word list, it was assigned a point for prevalence. The total prevalence score ranged from 0 to 10 based on the number of lists on which the word appeared, and the prevalence scores were then organized into three strata.
Word selection for inclusion on the measures proceeded by randomly sampling words within the three prevalence strata, ensuring approximately 60% of the words were from the low prevalence stratum and at least 20% were from the medium-prevalence stratum. This ratio was selected to increase the probability of a high number of unknown words at baseline while minimizing student frustration. Random selection of words was limited to two words per book. If a third word from a given book was randomly identified, an alternate word was randomly selected from the remaining books. Each of the first two words for Tell Me and WIS was intentionally selected from the high-frequency band to increase student engagement with the assessment. The final list of words for each measure was then reviewed to remove and replace words that were likely to produce spurious responding in the decontextualized assessment context, such as with homophones (sore vs. soar). The resulting word lists, definitions, and prevalence strata are presented in the Supplemental Material.
Selection of Scoring Metrics
Depending on the approach to scoring, vocabulary assessments for early elementary school age students can provide information on vocabulary knowledge as well as insight into other oral language skills such as syntax and morphology, which also have been found to be associated with later reading skills (Lervåg et al., 2018). Such metrics range from relatively simple to highly complex. Developmental Sentence Scoring (DSS; Lee & Canter, 1971), for example, is a complex scoring procedure for evaluating language development. A relatively simple alternative metric for scoring vocabulary knowledge is a 0- to 2-point scale, which is commonly used in standardized norm-referenced vocabulary measures (Wechsler, 2014). Because one of the goals of the current research was feasibility, we chose to adopt a 0- to 2-point scale where a 0-point score indicated no understanding, a 1-point score indicated partial understanding, and a 2-point score indicated complete understanding. Practically, assessors focused on identifying whether a response was completely irrelevant or incorrect (0-point response) or complete and accurate (2-point response), which by default allowed clearer identification of responses that reflected partial understanding (1-point response).
The 0- to 2-point scoring procedure did not consider grammar and syntax, but given both measures elicited short samples of oral language, we also explored a simple metric to serve as a proxy for oral language difficulties. For this purpose, we decided to also score the number of words in a given response. Although sentence or utterance length is not the only determinant of syntactic complexity, research suggests that longer sentences are typically more complex (Balthazar & Scott, 2015). To account for more coherent or accurate statements, we also developed a third metric in which the number of words was multiplied by the assigned rating.
Initial Field Test: User-Based Revisions
After identifying formats, words, and scoring procedures, preliminary versions of the measures were field tested in school settings with approximately 15 kindergarten and 15 first-grade students who had demographic characteristics similar to the participants of the planned pilot study. The field test included full administration of the measures, but did not include data collection. The field test identified several refinements to both the words used and the administration and scoring procedures for both Tell Me and WIS.
First, minor changes in the words were made. Words identified as phonetically similar but not identical were sometimes mistaken by younger children (e.g., three vs. tree). Although such mistakes are substantive from a language development perspective, phonological mistakes were considered to detract from the capacity to assess word knowledge as targeted. Phonetically similar words were replaced according to the random word sampling procedure described above. Second, word order within the list was modified. Words with similar meaning were separated, and in kindergarten, the order was changed to place more high- and medium-prevalence words at the beginning of the measure to prevent frustration. Third, administration directions for the kindergarten measure were modified to include two practice items, simplify the verbal instructions, and permit discontinuation of the response prompt after students understood the directions. Fourth, the scoring instructions were refined for each grade, using word count and qualitative scores with similar scoring directions for both measures. This process resulted in measures that were administratively similar across the two grades with minor variations to account for developmental differences between kindergarten and first-grade students.
Pilot Study Research Questions
After incorporating field test revisions, the Tell Me and WIS measures were used in a pilot study to evaluate their technical characteristics. The following research questions framed the pilot study:
Method
Participants and Context
Forty-two kindergarten and 43 first-grade students participated in the pilot study. For kindergarten, 48% of the students were male and 52% female. In first grade, 40% of the students were male and 60% female. With respect to race, the majority of the participants in both kindergarten and first-grade were Black (68% and 63%, respectively) followed by Hispanic/Latino (14% and 19%, respectively). Between 0% and 9% of the sample were American Indian/Alaskan Native, Asian, Multi-racial, and/or White. The proportion of students whose first language was not English was 40% in kindergarten and 37% in first grade.
Participants were drawn from a total of eight kindergarten and first-grade classrooms in one public (K–8) and one charter (K–5) school within a large metropolitan area in the Upper Midwest. Demographic data for the two schools were similar. In both schools, approximately 30% of students were designated English Learners, 84% were eligible for free or reduced-price lunch, 14% received special education services, and 90% were from culturally diverse (non-White) backgrounds. Consistent with the purpose of developing a measure for use in assessing the vocabulary skills of young, at-risk students, data for this pilot study were collected from students who were eligible for a school-based vocabulary intervention based on performance below cut scores on grade-specific criterion measures.
Participants in this study were randomly assigned to intervention (19 in kindergarten; 21 in first grade) or control conditions. Vocabulary performance across groups was equivalent at pretest for both grades, but intervention provided approximately 65 min of support per week over 25 weeks, in which target words were taught in the context of repeated read-aloud texts (Biemiller & Boote, 2006). In this pilot study, only the sensitivity to growth analyses were potentially impacted by the dissimilar instructional experiences across groups. Thus, all data were combined across conditions for analyses, with the exception of exploratory analyses for sensitivity to growth that accounted for condition.
Measures
Researcher-developed measures
The final measure for kindergarten was referred to as “Tell Me,” and consisted of 22 words. Two practice items were administered, in which the assessor modeled the response and the student had the opportunity to practice responding. Using standardized directions, the assessor then verbally presented each of the first two test items, instructing the student, “Tell me everything you know about _____. What does the word _____ mean?” The measure took approximately 8 min per student and produced three scores for each student: (a) the 0 to 2 rating of understanding, (b) the simple response word count, and (c) the adjusted word count in which the simple count was weighted according to the understanding rating. The final measure for first grade was referred to as WIS, and it consisted of 20 words administered and scored similar to Tell Me, with the exception of the response prompt that instructed students to, “Tell me the best sentence you can using the word _____. Tell me a sentence with the word _____.”
Criterion measures
The criterion measure for kindergarten was the Individual Growth and Development Indicators version 2.0, Picture Naming (IGDIs-PN). This measure is designed to provide information about vocabulary risk for students of age 3 to 5 years, and included fall and spring forms, each with 15 pictures that were selected via an item response theory framework. This study used the higher, more inclusive, IGDIs-PN risk level (scores of 10 or lower) for eligibility. The IGDIs-PN was administered in a one-on-one setting, in which the administrator used an easel picture display and asked students to verbally identify each picture. Administration time was typically less than 1 min. Item-level technical characteristics were adequate and criterion-related validity coefficients were 0.70 to 0.72 with broad tests of vocabulary like the PPVT (Bradfield et al., 2014).
The criterion measure for first grade was the 4,000 Word Listening Test. This measure is designed for first- through fourth-grade students and included 40 panels of four black and white line drawings of vocabulary concepts, one of which served as the target concept. Students identified the target word stated verbally by the adult administrator and recorded responses in a booklet, in which each page had two sets of panels of four pictures. This study used a publisher-recommended, normative risk level (score of 24 or lower) for eligibility. Administration occurred in groups of four to eight students, although some individual administration occurred for absent students, and took less than 30 min on average. Coefficient alpha was .79, and criterion-related validity was 0.63 with the Gates-MacGinitie Vocabulary Test (Graves & Sales, 2009).
Procedure and Assessment Administration Integrity
Data collection for all measures occurred at the beginning and end of the school year, with additional criterion measure data collection in January. Data collectors were research staff with graduate degrees in educational psychology trained by the second author. For each measure, administration fidelity was assessed by having a trained secondary assessor observe at least 30% of the administrations for each primary assessor using an assessment fidelity checklist. Across all co-observed administrations, the percentage of correctly administered steps was 99.2% for kindergarten and 99.8% for first grade.
Results
Descriptive statistics on experimental and criterion measures for kindergarten and first-grade students are presented in Tables 1 and 2. On the criterion measures, fall pretest mean scores were 7.24 (eligibility cut score was 10) for kindergarten students and 20.56 (eligibility cut score was 26) for first-grade students, indicating that, on average, students were below the measure-specific cut scores and therefore at risk for poor vocabulary outcomes as expected per the pilot study selection criteria. Subsequent mean scores on the criterion measures were higher than the pretest mean scores for both grades, although kindergarten mean scores decreased slightly from the January testing period to spring posttest.
Descriptive Statistics for IGDIs and Tell Me for Kindergarten Cohort.
Note. N = 42. IGDIs = Individual Growth and Development Indicators.
Posttest scores that are significantly higher than pretest scores at p < .05.
Descriptive Statistics for 4,000 Words and WIS for First-Grade Cohort.
Note. N = 43. WIS = Words into Sentences.
Posttest scores that are significantly higher than pretest scores at p < .05.
Inter-Rater Reliability
Inter-rater reliability data were collected at pretest and posttest by having two independent assessors score each protocol. Agreement for word count was greater than 99% across all administrations. For the 0 to 2 qualitative ratings, the scores assigned to each item by the independent rater were coded as either an agreement or disagreement to produce a percent agreement index. Percentage agreement for pretest ratings was 95% for kindergarten ratings and 92% for first-grade ratings. For posttest ratings, percentage agreement was 88% and 86%, respectively, for kindergarten and first grade. In addition, two independent graduate students in school psychology scored each item and their responses were significantly correlated with the study data collector ratings, r = .96 to .98 (p < .001).
Internal Consistency
Internal consistency of pretest rating and word count metrics was analyzed for each measure using Cronbach’s (1951) alpha coefficient. Results are not reported for the adjusted word count because those data were a direct translation to the word count metric based on ratings. Cronbach’s alpha coefficients for the entire set of items were .87 and .94 for the Tell Me rating and word count metrics, respectively. For WIS, the rating and word count alpha coefficients were .84 and .92, respectively. Item-total correlations tended to vary by item, but were consistently higher for the word count metric. For both Tell Me and WIS, two items had zero or very low item-total correlations (r < .05). Cronbach’s alpha did not improve by more than .03 when removing items with lower correlations. Posttest internal consistency results are not reported due to the potential influence of intervention conditions.
Sensitivity to Growth
As shown in Tables 1 and 2, kindergarten and first-grade mean vocabulary scores increased descriptively across assessment periods. The increase between pretest and posttest periods on each measure were analyzed using within-subjects analyses of variance (ANOVAs) for the three scoring metrics within each grade. Alpha levels for statistical significance were adjusted for the presence of multiple dependent variables using the Bonferroni procedure (p = .05/3 = .017). Results for Tell Me indicated that changes in within-subject scores across time periods were statistically positive for rating, F(1, 40) = 82.08, p < .01; total words, F(1, 40) = 17.91, p < .01; and words adjusted, F(1, 40) = 27.10, p < .01. For WIS, within-subject scores improved for rating, F(1, 41) = 15.15, p < .01, and words adjusted, F(1, 41) = 9.12, p < .01, but did not significantly change for total words, F(1, 41) = 5.73, p = .02.
Given the presence of treatment conditions, in which a subset of students in both grades received supplemental vocabulary instruction, we conducted exploratory analyses to determine the degree to which growth differed across groups. These analyses added an interaction term of treatment by time to the ANOVAs. For Tell Me, a significant interaction effect was observed for the rating metric, F(1, 40) = 10.52, p < .01, but not for total words or words adjusted. No significant interactions were observed for WIS.
In addition to analyses of growth, we examined word-level data at pretest to understand the extent to which sampling across prevalence bands promoted variation in student scores with minimal floor or ceiling effects. Of the 22 words on Tell Me, only one word, “drink,” received a pretest rating of 2 by over half of the students. Three words, “ride,” “mistake,” and “container” received a rating of 2 by at least 25% of students. The remaining 18 words received a rating of 2 by fewer than 25% of the students, supporting our word selection and sampling procedure. On the WIS measure, seven of the 20 words received a pretest rating of 2 by at least 25% of the students, with two words (“crawl” and “reach”) receiving a rating of 2 by at least 50% of the students. This means that 13 of the 20 words received a rating of 2 by fewer than 25% of the students. Thus, both measures appeared to have minimal floor or ceiling effects at pretest.
Criterion-Related Validity
Criterion-related validity of Tell Me and WIS was examined by correlating each of the scoring metrics (Rating, Total Words, and Words Adjusted at pretest and at posttest) with the criterion measures administered at pretest, midyear, and posttest. Correlations between Tell Me, WIS, and the grade-specific criterion measures are provided in Table 3. For Tell Me, the strongest correlations with the IGDIs-PN, across all time periods, were for the 0 to 2 rating score. The Tell Me rating correlated moderately with IGDIs-PN administered concurrently (r = .46 at pretest; r = .51 at posttest). Correlations between the Tell Me rating at pretest and the IGDIs-PN measure administered at midyear and at posttest also were moderate (r = .50 and .53, respectively). Correlations for the Words Adjusted score with IGDIs-PN were significant, but low across time points (.34 < r < .39). The correlations for the Total Words score with the IGDIs-PN was not significant at pretest (r = .22) and was significant but low at posttest (r = .32).
Correlational Validity Coefficients for Tell Me and WIS With Criterion Measures.
Note. WIS = Words into Sentences. IGDIs = Individual Growth and Development Indicators. *p < .05. **p < .01.
A similar pattern was evident for WIS. Although nonsignificant-to-weak correlations were observed for the word-count metric, stronger correlations were found between the 4,000 Words measure and the 0 to 2 rating score. Correlations were not significant with the 4,000 Words at pretest; however, the correlations of the pretest rating score with the 4,000 Words at midyear and at posttest were moderate (r = .42 and .53, respectively). The correlation between the spring rating score with the 4,000 words at posttest was strong (r = .73). Correlations between the fall 4,000 Words scores and the winter and spring 4,000 Words scores were not significant, but the correlations between the winter and spring 4,000 Words scores were moderate (r =.49, p =.001).
Discussion
Implications for Developing Instructionally Relevant Vocabulary Measures
A prerequisite for using data to inform educational decisions is that the assessments must produce technically adequate information (Salvia et al., 2017). Overall, the technical characteristics of the researcher-created vocabulary measures in this study were promising despite several areas for improvement. Data from both Tell Me and WIS demonstrated significant growth across scoring metrics, acceptable internal consistency, and significant correlations with criterion measures of vocabulary. Among scoring procedures, count-based metrics, though they tended to produce stronger internal consistency, correlated lower with validated measures of vocabulary; however, with the exception of first-grade concurrent criterion-related validity coefficients, the rating metric correlated better with other vocabulary measures, indicated significant growth, and had adequate internal consistency.
With respect to feasibility, it appeared that data from this pilot study could be accurately collected and scored for kindergarten and first-grade students identified as eligible for a vocabulary intervention, at least for the sample of students in our study. This suggests some merit in our considerations for the context of assessment, the students to be assessed, the assessors themselves, and the resulting decisions regarding the format, content, and scoring metrics. For example, from a logistical perspective, it is likely important within school settings to consider the typical length of administration time, which was about 8 min for each measure. Additional administration time might be necessary for assessing additional facets of vocabulary, and therefore, the potential instructional utility—and technical adequacy—of the resulting data would be an important consideration. Relatedly, our decision to slightly vary the format between the two measures to facilitate responding for beginning kindergarten students was supported at least in part by the feasibility and technical adequacy data. However, given no alternative administration approaches were tested, additional research could investigate the impact of various formats and scoring approaches on the practical and technical characteristics of the assessments. Similar considerations of the assessment context and students also would be necessary for different age groups and for assessing additional aspects of the broader vocabulary construct, such as assessing aspects of language development with prekindergarten students (Bornstein & Haynes, 1998).
Another consideration for future measure development would be to continue to refine the content selection for similar approaches to vocabulary assessment. Generalization is an important issue in vocabulary assessment, and evidence should ideally show that student vocabulary skills improve beyond instructed words (Nelson & Stage, 2007). In the context of the current measures, one approach to address generalized word learning could be to assess words from the books for which the children did not receive explicit instruction. Inclusion of these words might provide an indication of what words children learn from simple exposure to the word in a story as opposed to explicit instruction (Kelley & Goldstein, 2015). This approach could inform changes to the intervention that might promote a general interest in words and their meanings, or what has been termed “word consciousness” (Anderson & Nagy, 1992).
Regarding scoring procedures, one potential explanation for the relatively more promising data from rating metrics is that ratings better assess conceptual understanding, which is a defining feature of the vocabulary construct (Stahl & Nagy, 2006). Despite that potential advantage, the rating score produced slightly lower internal consistency relative to the word count metric and also had limited range. Ratings also do not assess other elements of language. Future research could therefore continue to explore the value of the count-based or other language-based scoring metrics. Such metrics serve as indicators of students’ syntactic skill and could serve as a basic indicator of growth in general word consciousness. Perhaps, over a longer period of study, these metrics would be better to evaluate the hypothesis that as children become more interested in words and word meanings, they may become more comfortable and fluent in talking about words and word meanings (Anderson & Nagy, 1992; Beck & McKeown, 2007).
Limitations and Future Research
It is important to acknowledge that the data in each grade were collected from two groups of students, one that received intervention and another that did not. Pretest data on all vocabulary measures are unaffected by this issue, and data for inter-rater reliability are unlikely to be meaningfully impacted. Exploratory analyses indicated greater improvement between testing periods on the kindergarten rating metric for treatment students relative to control students, but otherwise nonsignificant differences were observed between groups. Given these findings for growth, it can be concluded that these assessments have promise with respect to a subset of conventional technical adequacy indicators, including sensitivity to growth overall, but future research should investigate the sensitivity of these assessments to differences in improvement within and across distinct groups of students and instructional contexts.
An additional set of limitations relates to the characteristics and purpose of the Tell Me and WIS measures themselves. First, the characteristics of the initial pool of words, and aspects of the words selected for assessment, such as their difficulty or prevalence, could impact their technical characteristics. For example, the practical decision to place more prevalent words at the beginning of the assessment likely impacted split-half reliability. In addition, this study does not inform how these measures would perform for other purposes of assessment, such as monitoring shorter-term progress or screening for vocabulary risk (Coyne et al., 2015). Additional research would be necessary to understand the technical characteristics of more frequent assessments using alternative forms (Fuchs, 2004). Regarding screening, data from the current measures require an interpretive framework. Either normative or criterion-based referents would need to be developed and tested to inform the degree to which students are at risk (Glover & Albers, 2007). The way such frameworks apply to local contexts would also need to be explored, following researcher recommendations for other types of local benchmarking (e.g., Patton et al., 2014).
Other limitations relate to the fact that the pilot study employed a relatively small sample with a high proportion of English Learner (EL) students. Given the sample size of the pilot study, we were unable to report disaggregated data for EL and non-EL students, and the findings must be interpreted in light of the fact that a considerable portion of the sample had low vocabulary skills and were also identified as needing English learning services. Reading research has found that effective interventions for EL students have similar instructional characteristics to those used with the general population of learners (Cheung & Slavin, 2012), yet researchers recognize that EL students develop vocabulary skills in a more complex language context (Hoff et al., 2012). Knowing the degree to which at-risk native, non-native English speaking students learn words during vocabulary intervention is undoubtedly useful information for educators, but additional research is necessary to determine the role of language within vocabulary intervention.
Other limitations to the technical adequacy results concern the criterion measures themselves. Both measures were used at the edge of their intended age ranges. Although descriptive data suggested minimal ceiling or floor effects, the 4,000 Words test may have been difficult for at-risk urban first-grade students assessed immediately in the fall of the school year. For example, high guess rates during fall administration could explain why fall correlations from scores on the 4,000 Words test were not related to later administrations of the measure.
In addition, the 4,000 Words test and the IGDIs Picture Naming test assess limited aspects of vocabulary—receptive understanding and expressive labeling, respectively. The researcher-developed measures in this study produced significant correlations with these assessments, but using the measures to make interpretations about general vocabulary knowledge is inappropriate because of their inherent proximity to the treatment (Slavin & Madden, 2011). Such a limitation was acknowledged a priori in establishing the purpose of the measures, which focused on facilitating better practical decisions about young student vocabulary acquisition during intervention. However, additional research could be conducted to understand how, if at all, data from measures that are proximal to an intervention permit broader interpretations of overall vocabulary.
From a practice perspective, our measure development process required extensive time and resources, which may limit the degree to which educators can follow a similar process within their specific contexts. Researchers may be better able to engage in a thorough measure development process; however, some aspects of the measure development process could be streamlined. For instance, identifying word prevalence was largely an objective process, as words were either on lists or not. Establishing word prevalence could thus be efficiently accomplished for similar-aged students. In addition, the administration and scoring procedures are consistent with, and contribute to, an existing research base (Boomer & Stewart, 2016; Kaminski et al., 2013, 2014). Some consensus seems to exist regarding the administration and scoring procedures designed to assess vocabulary skills with younger students and for similar purposes. Thus, limitations with respect to the feasibility of our measure development process may be most relevant for older age groups and if educators are interested in assessing alternative facets of the vocabulary construct (Pearson et al., 2012).
Conclusion
Given the importance of vocabulary to overall reading proficiency (Lesaux et al., 2010; Stahl & Nagy, 2006), it is imperative to have better assessment data to inform instructional decisions (Pearson et al., 2012). Documenting the development, feasibility, and technical adequacy of instructionally relevant vocabulary measures has the potential to advance both research and practice. This study found some promise for a measure development process that considered format, word selection, and scoring metrics. Metrics that accounted for conceptual understanding produced relatively more promising data than count-based metrics, yet given the context of the constraints of the measures themselves, additional research is warranted.
Supplemental Material
Appendices_1 – Supplemental material for Development and Technical Adequacy of Instructionally Relevant Vocabulary Measures for Young Students
Supplemental material, Appendices_1 for Development and Technical Adequacy of Instructionally Relevant Vocabulary Measures for Young Students by David C. Parker, Lisa H. Stewart, Susan Thomson and Ruth A. Kaminski in Assessment for Effective Intervention
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was completed with support from a grant from the Social Innovation Fund of the Corporation for National and Community Service.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
