Abstract
Spelling is a foundational literacy skill that supports both word reading and written expression. For students with or at risk of a learning disability (LD), difficulties in spelling often constrain the fluency and complexity of writing, making effective interventions essential. Yet, the conclusions drawn about intervention efficacy depend heavily on how outcomes are measured. This review synthesizes outcome measurement practices across 59 spelling intervention studies conducted over the past five decades. All outcome measures (n = 233) were coded by type (researcher-developed vs. norm-referenced) and by linguistic level (sublexical, lexical, sentence, discourse) using the Interactive Dynamic Literacy (IDL) framework. Descriptive analyses revealed that nearly four out of five outcomes were lexical, most often researcher-developed lexical-level spelling probes, with comparatively few outcomes at the sentence or discourse levels. Standardized assessments were similarly concentrated at the word level, with the Wide Range Achievement Test–Spelling subtest and Test of Written Spelling most commonly used. Finally, the pairing of proximal and standardized outcomes was inconsistent, particularly among group designs. Taken together, findings highlight a measurement bottleneck: spelling interventions are evaluated primarily through lexical-level accuracy, offering limited insight into whether gains transfer to the higher-level writing processes for students with or at risk for LD.
Despite the availability of spell-check and predictive text, spelling remains a critical academic skill that underpins literacy development. Accurate and automatic spelling reduces transcriptional demands, which allows students to devote more cognitive resources to planning, idea generation, and organization in writing (Berninger et al., 2002; Graham et al., 2016). Well-specified word forms also contribute to efficient word recognition, linking spelling to broader literacy outcomes (Ehri, 2000). For students with or at risk for learning disabilities (LD), difficulties in spelling often constrain the length and complexity of their written products, which underscores the importance of effective spelling instruction and assessment (Graham, 1997). Recent meta-analytic evidence indicates that spelling interventions produce meaningful gains, with moderate effects on both spelling (g = 0.33) and word reading (g = 0.28) for students with or at risk of LD (Chandler et al., 2025). However, questions remain about how spelling performance is assessed. The present review addresses this issue by examining how outcome measures in spelling intervention research capture different levels of language and the extent to which current assessment practices reflect models of literacy development.
Theoretical Foundations to Guide Spelling Assessment
Spelling is best understood not as a single skill but as an integration of phonological, orthographic, and morphological knowledge (Apel & Masterson, 2011). From this perspective, assessment should move beyond global accuracy scores—simple sum of the number or percentage of words spelled correctly—to capture the linguistic strategies students use when spelling. A global score doesn’t allow for insight into why an error occurred—whether the student can represent phonemes with graphemes, apply orthographic conventions, or use morphological relationships to generate correct spellings. Theory-aligned assessment is important, not only for documenting growth, but also for guiding instruction: educators need to know which linguistic components are changing in order to target instruction effectively. This approach aligns with broader models of literacy development.
The interactive dynamic literacy (IDL) model (Kim, 2020), for example, organizes literacy according to linguistic grain size—sublexical, lexical, sentence, and discourse—and posits hierarchical, interactive, and dynamic relations among levels. Within this model, spelling is situated at the lexical level, constrained by sublexical processes (e.g., phoneme–grapheme mapping, handwriting fluency) and expected to exert upward influence on sentence fluency, discourse-level writing, and reading comprehension. Critically, the IDL underscores that observation of these relations depends on how outcomes are measured—different task formats and scoring dimensions can differentially tap component skills. To test this predicted cascade, intervention studies require outcome measures that are both theory-guided and transfer-sensitive (sentence and discourse-level writing). In this way, theory-driven assessment becomes the bridge between intervention design and instructional decision-making. It tells us not only whether interventions improve spelling of taught sublexical skills or words, but whether those improvements generalize into broader literacy outcomes.
The Challenge of Skill Transfer for Students With LD
For students with LD, the issue of generalization and transfer is especially relevant. Decades of research in special education have documented that the group of learners often demonstrate improvement on skills taught in isolation but struggle to apply those skills flexibly across tasks, contexts, or levels of language (Gersten et al., 2000; Swanson, 2014). In spelling, this means that gains observed on proximal tasks—such as researcher-developed word lists—may not translate into improved sentence construction or discourse-level writing. Insights from learning sciences similarly emphasize that transfer requires instruction and assessment conditions that mirror the complexity of authentic tasks (Bransford & Schwartz, 1999). In other words, if the goal is for students to apply spelling knowledge in sentence construction, content-area writing, or extended composition, then our outcome measures must likewise move beyond isolated word lists to capture performance within connected, meaningful writing tasks. If outcome measurement in intervention studies remains narrowly focused on lexical-level accuracy, we risk overstating the practical impact of interventions for learners who need the most support. For these students, theory-guided and transfer-sensitive assessments are essential to determine not just whether spelling instruction works, but whether it meaningfully improves their capacity to produce fluent, accurate, and complex writing.
Spelling Outcome Measurement
Despite nearly five decades of spelling intervention research, we know relatively little about whether spelling interventions produce gains that generalize beyond the tasks on which students are directly taught and tested. Most studies report outcomes on proximal lexical-level measures (e.g., Bazis et al., 2022; Darch et al., 2000; Graham & Freeman, 1986), which are sensitive to instruction but cannot establish whether improvements transfer to broader literacy tasks such as sentence construction or discourse-level writing. Standardized assessments, when used, are similarly concentrated at the word level and rarely capture writing performance (e.g., Ehri et al., 2009; D. C. Simmons et al., 2007). Moreover, it remains unclear how often researchers combine proximal and standardized outcomes to evaluate both sensitivity and generalizability. This narrow approach to outcome measurement creates a critical blind spot in spelling intervention research. Positive intervention effects may reflect student performance on tightly aligned lexical-level tasks, but do these interventions support broader skill development and improvements in the authentic writing abilities that matter most for students LD?
Purpose of the Present Review
The purpose of this review is to examine how outcome measures have been used in spelling intervention research over the past five decades. Specifically, we ask: What are the characteristics of outcome measures used in spelling intervention studies? To address this question, we consider (a) the overall landscape of assessment practices; (b) the standardized, norm-referenced assessments are most frequently used across this body of research; and (c) the extent to which studies report both proximal and standardized measures to evaluate intervention effects. We situate this review within Apel and Masterson’s (2011) call for theory-guided spelling assessment and Kim’s (2020) IDL model, recognizing that assessments must reflect the multidimensional nature of spelling and capture how improvements in spelling transfer across multiple levels of language.
In this manuscript, we use the terms sublexical and lexical to refer to sub-word-level and word-level processes, respectively, consistent with the terminology of the IDL model. When discussing outcome measures, these terms describe the linguistic grain size targeted by an assessment rather than the format or complexity of the task.
Method
The present review reports a secondary descriptive analysis of outcome measurement practices drawn from a previously published meta-analysis of spelling interventions for students with or at risk for LD (Chandler et al., 2025). That meta-analysis synthesized intervention effects across spelling, reading, and writing outcomes. In contrast, the current manuscript focuses specifically on how outcomes were measured, including the types of assessments used and the linguistic levels they targeted. No new studies were identified for the present analysis, and no additional data extraction beyond outcome measurement characteristics was conducted.
Literature Search and Study Identification Procedures
The dataset for this review was derived from a comprehensive systematic search conducted for the original meta-analysis. Studies were identified through electronic database searches of PsycINFO, ERIC, Academic Search Complete, and Education Source for research (including dissertations and theses) published in English between 1975 and December 2022. This date range was selected to capture all spelling intervention research conducted since the passage of the Education for All Handicapped Children Act of 1975 (now the Individuals with Disabilities Education Act [IDEA]).
Search terms were developed with assistance from a university-based librarian. Four electronic databases (i.e., PsycINFO, ERIC, Academic Search Complete, Education Source) were used with the following search terms: spelling or orthograph* AND “learning disab*” OR “learning disorder*” OR “learning diff*” OR “reading isability*” OR “reading disorder*” OR “reading diff*” OR dyslexi* OR (read* n1 strugg*) OR (read* n1 slow) OR (read* n1 delay*) OR “writing disabil*” OR “writing disorder*” OR “writing diff*” OR dysgraphia OR “special needs” OR “students with disabilities” OR “mild handicap*” OR “poor spellers” AND intervention OR strategy OR strategies OR treatment OR approach OR supplemental OR “pull out” OR “small group*” OR remedia* OR differentiat* OR program OR curricul* OR lesson OR teaching method OR instruction OR training. This initial search, conducted in January of 2023, returned 3,916 results.
Screening and Eligibility Criteria
After removal of duplicate records (n = 926), 3,012 abstracts were screened using the Covidence systematic review platform. Abstract screening was completed independently by the first author and a doctoral student, with 95% agreement and discrepancies resolved through discussion. Studies were eligible for inclusion if they met the following criteria:
At least 50% of the participants were K-12 students (i.e., ages 5–18) with or at risk of LD. Students with or at risk of LD had to be identified either through researcher-administered screening procedures or school-based identification procedures (e.g., identified as a student with a disability under IDEA, students have Individualized Education Programs [IEPs]).
The study reported at least one primary outcome related to spelling or reading (e.g., word reading, oral reading fluency, orthographic learning).
The intervention had to primarily target spelling proficiency, with at least 50% of the activities dedicated to spelling instruction or practice in at least one treatment condition. This criterion was assessed based on dosage, as reported by the authors in terms of time allocated to spelling tasks.
The intervention was delivered in English.
The study employed a randomized controlled trial (RCT), quasi-experimental design (QED), or single case design (SCD). We defined an RCT as any study in which participants are randomly assigned to one of at least two groups, with at least one receiving an intervention and one serving as a contrasting comparison condition (i.e., control). A QED was defined as any study comparing outcomes between at least one group of participants who received an intervention and at least one group serving as a contrasting control condition; however, group membership did not need to be determined through random assignment. For a SCD to be included, it had to have at least three opportunities for a demonstration of intervention effect at a minimum of three different points in time between adjacent conditions. SCDs that were accepted include withdrawal/reversal designs (i.e., ABAB) and multiple baseline or probe designs. SCDs with less than three demonstrations of effect (e.g., AB, ABA designs) were not included. Alternating treatment, adapted alternating treatment, changing criterion, and multielement SCDs were excluded unless there was a true baseline prior to intervention and a control condition throughout the experiment. Without such features, these designs cannot yet be used to calculate SCD effect sizes.
The study included sufficient data to calculate effect sizes (e.g., means and standard deviations, F, or t values).
Backward and forward reference searches, ancestral reviews of prior syntheses, and targeted hand searches of prominent special education and literacy journals were also conducted as part of the original review process.
Full-Text Review
We retrieved full-text articles for all studies deemed eligible following abstract screening (n = 229). Five studies could not be retrieved despite attempts to contact authors, resulting in 224 studies advancing to full-text review; all unretrievable studies were dissertations. Prior to full-text review, author team members completed a 2-hour training focused on inclusion criteria and sample studies. Each article was then independently reviewed by two team members.
To promote consistency and reliability, the team met after reviewing the first 20 articles and again after the next 100 articles to resolve discrepancies and refine inclusion criteria language. These refinements clarified what constituted an explicit spelling intervention and distinctions among intervention types, without altering substantive eligibility criteria. Following full-text review, disagreements were resolved through discussion until consensus was reached. Interrater reliability at this stage was high (93%).
Of the studies reviewed, 165 were excluded for failing to meet one or more eligibility criteria, including ineligible study design (n = 66), fewer than 50% of participants identified with or at risk for LD (n = 37), intervention delivered in a language other than English (n = 22), intervention not primarily focused on spelling (n = 20), insufficient data for effect size calculation (n = 10), wrong publication type (n = 6), publication in a non-English language (n = 2), or absence of a relevant spelling or reading outcome (n = 2). A total of 59 studies met all inclusion criteria and were retained for analysis.
Outcome Measure Coding for the Present Review
All outcome measures reported in the 59 included studies were coded using a structured codebook developed for the original meta-analysis. For the purposes of the present review, coding focused specifically on measurement characteristics, including type of measure and linguistic level assessed. First, outcomes were categorized by type of measure as either researcher-developed (proximal) assessments or norm-referenced (standardized) assessments with published reliability and validity evidence. When a study included both types, outcomes were coded accordingly. Second, outcomes were coded by linguistic level, guided by the IDL model (Kim, 2020). Sublexical outcomes included measures of phonological awareness, sound–spelling correspondences, orthographic knowledge, handwriting, or morphological awareness. Lexical outcomes captured word-level spelling, word reading, pseudoword decoding, and vocabulary. Sentence-level outcomes required application of spelling within connected text (e.g., sentence dictation or sentence writing), and discourse-level outcomes included paragraph- or passage-level writing, oral reading fluency, and reading comprehension.
Coding was completed by trained doctoral students and the first author. All studies were double coded, and discrepancies were resolved through discussion until consensus was reached. Interrater reliability for measurement-related variables was high, with mean agreement of 97%, consistent with reliability reported in the original meta-analysis.
Data Analytic Approach
To explore the landscape of assessment practices in spelling intervention research, we summarize the distribution of outcome measures by type of measure (i.e., researcher-developed, norm-referenced, or both) and linguistic level (i.e., sublexical, lexical, sentence, discourse). We also examine patterns by study design—group versus SCD studies—given their differing traditions of outcome measurement. In addition, we catalog the standardized assessments most frequently used across studies. Finally, we consider the alignment of outcomes with theoretical expectations from the IDL model, with particular attention to whether outcome batteries captured potential transfer from lexical-level spelling to sentence- and discourse-level written expression.
Results
Across 59 studies, a total of 228 outcome measures were coded. Table 1 summarizes the measure characteristics by study, and Table 2 summarizes the distribution by type of measure and linguistic level. The vast majority of outcomes (n = 164; 71.9%) were at the lexical level, such as word-level spelling accuracy, word identification, vocabulary, and pseudoword decoding tasks. Most of these measures were researcher-developed (n = 112; 68.3%), and fewer were norm-referenced standardized tools (n = 52; 31.7%). Outcomes at the sublexical level (e.g., phonological awareness, orthographic knowledge, handwriting, morphological awareness) were less common (n = 38; 16.7%). Within this small set, researcher-developed measures (n = 25; 65.8%) outnumbered standardized measures (n = 13; 34.2%).
Spelling Outcome Measure Characteristics.
Note. • = Norm-referenced measure; ◽ = Researcher-created measured; ⦿ = Both types of measures administered.
Distribution of Spelling Outcome Measures Across Type and Linguistic Levels.
Note. Percentages within columns represent proportions of measures at each level of language. “% of All Measures” reflects the proportion of the total 228 coded outcome measures across all 59 studies.
Assessment of sentence-level outcomes was particularly limited. Only six measures were identified (2.6% of all outcomes), with five norm-referenced tasks (83.3%) and one researcher-developed measure (16.7%). Similarly, discourse-level outcomes were scarce (n = 20; 8.8%). These included paragraph or passage writing tasks, oral reading fluency, and reading comprehension measures. Most of these measures were researcher-developed (n = 14; 70%), with six standardized (30%).
Patterns differed somewhat by study design. Group design studies (n = 39) more frequently included standardized assessments alongside proximal measures, particularly at the lexical level. In contrast, SCD studies (n = 20) relied almost exclusively on proximal outcomes, most often researcher-developed spelling probes administered repeatedly to capture functional relations over time. Although this approach is consistent with the methodological traditions of SCD research, it further reinforces the overall pattern: outcomes remain concentrated at the word level, and few studies—regardless of design—assess sentence- or discourse-level writing.
In addition to the distribution of measures by linguistic level, we examined the types of proximal outcome measures reported. Across studies, proximal measures were most often researcher-developed word-level spelling probes, typically scored as percent correct on dictated word lists, often distinguishing taught from untaught words. Sublexical tasks, such as phoneme–grapheme mapping or orthographic choice, were less common and only a handful of studies incorporated sentence dictation or sentence-writing tasks. A small subset of studies included discourse-level proximal outcomes, such as expository writing samples, though these were rare. Overall, proximal measures were dominated by lexical-level spelling probes, reinforcing the pattern that most outcome measurement remains narrowly focused on lexical-level accuracy. In contrast, standardized assessments were more varied by instrument name but similarly concentrated at the lexical level, with only limited coverage of sentence- and discourse-level writing.
Standardized Assessments Used in Spelling Intervention Research
Across the 59 studies, 16 unique norm-referenced assessments were identified that resulted in 76 outcome measures. As shown in Table 2, the use of standardized assessments was heavily concentrated at the lexical level (n = 52; 68.4%), with far fewer measures at the sublexical (n = 13; 17.1%), sentence (n = 5; 6.5%), or discourse levels (n = 6; 7.9%).
As represented in Table 3, standardized assessments were distributed across range of subtests rather than common use of a single measure. The most commonly used standardized spelling assessments were the Wide Range Achievement Test–Spelling subtest (WRAT Spelling), the Wechsler Individual Achievement Test (WIAT) Spelling, and the Test of Written Spelling (TWS). Word reading was also commonly assessed using standardized, norm-referenced measures. The most frequent were the Woodcock Reading Mastery Tests (WRMT) Word Identification and Word Attack subtests, and the Test of Word Reading Efficiency (TOWRE). Sublexical skills were most often measured using the Comprehensive Tests of Phonological Processing (CTOPP) and Dynamic Indicators of Basic Early Literacy Skills (DIBELS).
Most Common Standardized Assessments in Spelling Intervention Research.
Note. WRMT = Woodcock Reading Mastery Test; WRAT = Wide Range Achievement Test; WJ = Woodcock-Johnson; WIAT = Wechsler Individual Achievement Test; CTOPP = Comprehensive Test of Phonological Processing; TOWRE = Test of Word Reading Efficiency; PAL = Phonological Awareness for Literacy; DIBELS = Dynamic Indicators of Early Literacy Skills.
Far fewer standardized measures targeted broader writing outcomes. Only the WIAT Written Expression subtest and the Woodcock-Johnson (WJ) Writing cluster were identified as discourse-level measures, while sentence-level standardized tasks remained rare. A wide range of other assessments—such as the Gray Oral Reading Test (GORT), the Kaufman Test of Educational Achievement (KTEA), and several vocabulary measures—appeared only once across studies, which highlights the limited and uneven use of standardized tools for assessing higher-level literacy outcomes.
The Use of Proximal and Standardized Outcome Measures
We next examined whether studies included both proximal (researcher-developed) and standardized (norm-referenced) outcome measures. Across the 59 studies, most relied exclusively on proximal assessments (k = 37; 62.7%). A smaller subset (k = 22; 37.3%) incorporated both proximal and standardized outcomes, allowing researchers to capture sensitivity to instruction while also evaluating generalizability. Only a handful of studies (k = 4) relied solely on standardized assessments without including any proximal tasks. Patterns differed by study design. Among group design studies (k = 39), 56.4% (k = 22) included both proximal and standardized measures, 13 relied exclusively on proximal measures, and only 4 relied solely on standardized measures.
Discussion
Across 59 studies and 228 outcomes, assessments were clustered overwhelmingly at the lexical level (71.9%), with very few sentence- or discourse-level outcomes. Even within the lexical band, most measures were researcher-developed, whereas standardized tools were used less often. This review highlights a persistent imbalance in how spelling intervention outcomes have been assessed. Across nearly five decades of research, outcome measurement has been overwhelmingly concentrated at the word level, with relatively few studies examining sentence- or discourse-level writing. This narrow focus limits our understanding of whether interventions support the broader goal of writing—and overall literacy—development. From a theoretical perspective, the IDL model predicts that growth in spelling should contribute to higher-order literacy outcomes, including sentence fluency and discourse-level composition. Yet, without assessments that capture these levels, we cannot determine whether interventions achieve transfer beyond explicitly taught sublexical- or lexical-level skills. For students with or at risk for LD—who often struggle with generalization—the lack of higher-order outcomes leaves a critical gap: interventions may show success on tightly aligned proximal tasks but fail to demonstrate improvements in authentic writing performance.
The Landscape of Outcome Measures in Spelling Intervention Research
Our descriptive synthesis showed that nearly four out of five outcome measures were lexical, with far fewer assessing sublexical-, sentence-, or discourse-level skills. Proximal measures were especially dominated by lexical-level spelling probes, such as dictated word lists scored for percent correct. While these outcomes are sensitive to instructional effects, they provide only a partial picture of spelling development. From the perspective of the IDL model, such measures capture growth at the lexical level but fail to assess whether improvements extend or cascade upward into sentence- and discourse-level writing. Similarly, from Apel and Masterson’s (2011) perspective, global lexical-level accuracy scores obscure the specific phonological, orthographic, and morphological strategies students use when spelling.
This pattern echoes broader findings in literacy intervention research. Reading intervention studies frequently demonstrate gains on proximal decoding tasks without corresponding improvements in comprehension (e.g., Wanzek & Vaughn, 2007; Wanzek et al., 2010), which highlights the difficulty of demonstrating transfer to higher-order outcomes. Research on writing intervention shows a similar imbalance: transcription-focused interventions often improve spelling or handwriting but rarely assess, or find weaker effects on, discourse-level composition quality (Graham et al., 2018). For students with LD, this gap is particularly consequential, as difficulties in spelling have been shown to constrain the length, complexity, and quality of written text (Berninger et al., 2002; Graham, 1997). The dominance of lexical-level outcomes in spelling intervention research therefore limits our ability to determine whether interventions contribute to the higher-order writing skills that are most critical for academic success.
What the Use of Standardized Spelling Assessment Reveals
Our review also showed that standardized assessments, when used, were almost exclusively concentrated at the lexical level, with WRAT Spelling, TWS, and WIAT Spelling emerging as the most common standardized spelling assessments. This reliance reflects a narrow tradition within the field—one that prioritizes global accuracy scores while overlooking more fine-grained dimensions of spelling performance and the underlying mechanisms of learning.
Although these measures are often treated as interchangeable, they differ meaningfully in their word lists, linguistic content, and psychometric properties. For instance, the proportion of polymorphemic words included varies considerably (57% in WRAT Spelling, 36% in TWS, and 63% in WIAT Spelling). Such differences introduce systematic variability in how intervention effects are quantified and compared across studies, complicating efforts to synthesize findings and draw generalizable conclusions. Moreover, because scoring is typically limited to correct/incorrect, or whole-word accuracy, these assessments conflate different error types (e.g., phonological vs. orthographic vs. morphological), making it difficult to isolate which aspects of spelling knowledge were strengthened by the intervention.
More importantly, none of these measures directly assess the theoretical processes central to spelling development—phonological, orthographic, and morphological knowledge—or the transfer of spelling to authentic, higher-level writing tasks. Thus, while standardized measures are often treated as distal outcomes, they may function more as loosely related proxy indicators rather than true measures of whether interventions improve students’ capacity to apply and transfer spelling knowledge to higher-order tasks.
Pairing Proximal and Standardized Outcomes
A final pattern in this review concerned how often studies combined proximal and standardized measures. Nearly two-thirds of studies relied exclusively on proximal, researcher-developed outcomes, while just over one-third included both, and only a handful used standardized measures alone. This distinction matters for how intervention effects are interpreted. Proximal outcomes, such as researcher-developed word probes, are highly sensitive to instruction and are well-suited for detecting immediate effects of targeted practice. However, when used in isolation, they cannot address whether students can apply spelling knowledge more broadly. Standardized assessments, by contrast, are often positioned as distal outcomes, yet our review shows that they are inconsistently used and limited largely to global accuracy at the word level.
A key consideration in interpreting these patterns is the study design. In SCD, repeated measurement of a proximal dependent variable is not a weakness but a methodological feature: frequent, instruction-aligned probes are central to demonstrating experimental control through changes in level, trend, and replication across phases (Horner et al., 2005). These outcomes maximize sensitivity to intervention effects, but they rarely extend to sentence- or discourse-level transfer.
By contrast, group designs provide a stronger opportunity to balance proximal sensitivity with distal generalizability. Yet, in our sample of 39 group design studies, while more than half (n = 22) included both proximal and standardized outcomes, 13 relied exclusively on proximal measures. This pattern highlights a missed opportunity: group studies, in particular, should be structured to capture both the skills directly taught and the extent to which those skills generalize to standardized and higher-order writing outcomes.
Taken together, these findings suggest that intervention research benefits from a design-aligned approach to measurement. Although SCD studies rely on repeated proximal probes, they could be further strengthened by occasional inclusion of sentence- or discourse-level outcomes to measure transfer. Group design studies, meanwhile, should routinely pair proximal and standardized assessments to ensure that intervention effects are tested for both sensitivity and generalization across multiple linguistic levels.
Implications for Intervention Research and Practice
Overall, the findings from this review underscore that outcome measurement in spelling intervention research has been narrowly defined, with the vast majority of outcomes focused at the word level and relatively few capturing sentence- or discourse-level writing. Standardized assessments, when used, were similarly concentrated on global lexical-level accuracy, and many group design studies relied solely on proximal measures without including corresponding standardized assessments. This creates a fundamental challenge for evaluating intervention effects. Instructionally sensitive outcomes are effective for detecting immediate learning, but their widespread use leaves limited evidence regarding whether spelling gains generalize to authentic writing tasks.
From a theoretical standpoint, this imbalance in measurement limits the field’s ability to test core predictions about literacy development. The IDL model highlights how growth at the lexical level should influence sentence and discourse performance (Kim, 2020), while Apel and Masterson’s (2011) linguistic perspective stresses the importance of assessing the specific strategies students use when spelling. Yet, most intervention studies provide little information about whether improvements in spelling contribute to higher-order writing outcomes or whether interventions alter the phonological, orthographic, and morphological processes that underlie spelling growth.
For students with or at risk of LD, this gap has practical consequences. This group of learners often struggles with transfer of skills learned in isolation, making it critical to determine not only whether an intervention improves spelling of taught words but also to what extent those gains extend to written expression. A comprehensive approach to outcome measurement (i.e., one that captures both proximal sensitivity and distal generalization) is therefore essential. Such measurement is necessary not only for evaluating the effectiveness of interventions in research, but also for guiding instructional decisions in practice. Without it, the field risks both overstating intervention effects and underestimating the role of spelling in writing and overall literacy development.
Limitations
Several limitations of this review warrant consideration. First, our analyses were descriptive and relied on the information reported in primary studies. Proximal measures were often described in general terms (e.g., “researcher-developed spelling probe”), which limited our ability to fully characterize their linguistic focus and psychometric quality. Second, some standardized assessments appeared only once or twice across studies, making it difficult to draw strong conclusions about their frequency of use or comparability. Finally, because this review drew on outcome coding from a prior meta-analysis, it reflects the scope and detail of that database rather than a newly conducted comprehensive search.
Conclusion
Despite decades of intervention research, outcome measurement in spelling intervention studies remains narrowly concentrated at the lexical level, with limited attention to sentence- and discourse-level writing. Standardized assessments most often reflect global word accuracy, and there is inadequate attention to the inclusion of both proximal and standardized measures within a study. For SCD studies, standardized measures are rarely included, whereas the pairing of proximal and standardized measures is inconsistent in group design studies. This measurement pattern leaves a critical blind spot: we know that interventions can improve spelling and word reading of taught subskills and words, but we know far less about whether these gains generalize to the authentic writing tasks. Situating these findings within Apel and Masterson’s (2011) call for theory-guided assessment and Kim’s (2020) IDL model, we conclude that spelling intervention research must more purposefully design assessment batteries that capture both the mechanisms of spelling development and the transfer of spelling gains into higher-order literacy outcomes. Only then can the field fully evaluate whether spelling interventions equip students with the skills necessary for fluent and accurate written expression.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
