Abstract
Replications provide credibility by demonstrating under what conditions experimental findings can be repeated, the premise behind evidence-based practices. Replications in single-case research also investigate generalization of findings across groups. For groups with high variability, such as individuals with autism, assumptions of generalizability should be based on learners who are similar in critical ways. The purpose of this study was to use Coyne et al.'s framework for replication and the next generation guidelines for single-case research to extend understanding of “for whom” and “under what conditions” modified schema-based instruction (an established evidence-based practice for individuals with autism) is effective. In this distal conceptual replication of Root et al., contextual and instructional variables of theoretical and practical importance were intentionally manipulated or maintained and reported to model transparency and support replicability. Four high school students receiving special education under the Individuals With Disabilities Education Act category of autism were taught mathematical and social problem-solving behaviors within the context of percentage-of-change word problems. Researchers used modified schema-based instruction and augmented reality in a one-on-one setting and assessed generalization to purchasing in the food court of a mall biweekly. We frame our discussion around the recommendations for replication research from Coyne et al. and recommendations for single-case research from Ledford et al., concluding with suggestions for future replications that use single-case research designs.
The use of scientifically validated practices is both necessary to improve educational outcomes and legally mandated. The variable strengths and needs of learners with autism spectrum disorder (ASD) means identification of such practices relies on repeated experimental demonstrations of the effectiveness of focused intervention practices that address a specific skill or goal with learners who are similar in critical ways (e.g., age, diagnosis, cognitive abilities, language level; Odom et al., 2010). If the observed effectiveness of the intervention is real, it should be reproducible when evaluated with similar learners under similar conditions (Simons, 2014). Replications provide credibility by demonstrating the findings can be repeated and/or generalized, answering the question “What works, for whom, under what conditions?”
In this study, we conducted a distal conceptual replication of Root et al. (2022) to empirically test if the intervention would be effective for a different group of students in a different environment (high school vs. transition program). To provide background, we first highlight how a series of replication studies accumulated evidence for modified schema-based instruction (MSBI), a mathematics word-problem-solving intervention. Then we discuss the role of replication in identifying and understanding the boundaries of generalization of evidence-based practices. Finally, we present the purpose and research questions of this replication study.
Building the Evidence Base Through Conceptual Replications
Establishing evidence-based practices depends on an accumulation of evidence from methodologically rigorous experimental studies (Cook et al., 2016; Therrien et al., 2016). Even if not explicitly identified as replications, researchers have used replication logic to demonstrate “for whom” and “under what conditions” a practice is effective (Travers et al., 2016). Although there is variability across professional groups on the degree of evidence needed for a practice meet the criteria of “evidence based,” replication of effects across multiple methodologically rigorous single-case-design studies is a consistent requirement.
Root, Jimenez, et al. (2020) described how a replication framework was used across 16 research studies published between 2012 and 2020 to develop and refine interventions to teach number sense and word problem solving to students with extensive support needs, including ASD. They detail the initial word-problem-solving studies carried out by Browder and colleagues through the Solutions Project. This 4-year, federally funded project by the Institute of Education Sciences, National Center for Special Education Research (Grant No. R324A130001), focused on using schema-based instruction (SBI) to teach mathematical word problem solving to students with ASD and moderate intellectual disability (ASD/ID).
MSBI to Teach Additive Schemas
Grounded in neuroscience, schema theory is concerned with how the brain structures knowledge. It emphasizes the role of prior experience to acquiring new knowledge and reinforces using tools such as graphic organizers to bridge this new knowledge to older knowledge (Merriam et al., 2007). Schemas are dynamic in that they are influenced by new information and experiences (Anderson et al., 1983). In mathematics, SBI explicitly teaches students to use repeated reasoning to recognize common structures of word problems (i.e., schemas) and partners those structures with schematic diagrams (i.e., graphic organizers). Browder and colleagues (2018) enhanced the instructional methods of traditional SBI (Fuchs et al., 2004; Jitendra & Star, 2011) by supplementing with evidence-based practices for students with more extensive support needs (i.e., moderate to severe intellectual disability, profound autism). Spooner et al.'s (2017) conceptual model for this MSBI outlines the evidence-based practices incorporated to support students to (a) access word-problem-solving tasks, (b) develop conceptual and procedural skills needed to solve the word problem, and (c) generalize.
The initial work of the Solutions Project demonstrated that students with ASD/ID could learn to solve and discriminate between additive schemas. In the initial MSBI study, Saunders (2014) taught three elementary students with ASD/ID to solve and discriminate between two additive problem types (group and change). Root, Browder, et al. (2017) built on the work of Saunders (2014) by using MSBI to teach participants with similar characteristics to solve the third additive schema (comparison). Their findings informed the pilot evaluation of small-group MSBI instruction with eight elementary students with moderate intellectual disability (Browder et al., 2018). This pilot intentionally altered variables related to interventionist (teacher instead of researcher), scope (all three problem types), and instructional format (small group) to build evidence of MSBI's effectiveness. Concurrent conceptual replications were conducted to extend boundaries on generalization, including (a) varying the location of the unknown quantity (Root & Browder, 2019), (b) extending to quantities above 10 (Root et al., 2018), (c) monetary quantities with decimals (Root, Saunders, et al., 2017), (d) peer tutors as interventionists (Ley Davis, et al., 2022), and (d) acquisition of literacy skills as nontargeted information through instructive feedback (Brosh et al., 2018).
MSBI to Teach Multiplicative Schemas
Root and colleagues extended the application of MSBI to multiplicative schemas through a series of conceptual replications, including four single-case studies targeting the percentage-of-change schema (see Table 1). In the original study, Root, Cox, et al. (2020) taught middle school students with ASD/ID to solve percentage-of-change problems related to personal finance (i.e., finding final cost with tip or coupon). Two participants were provided with additional self-management supports to maintain engagement. A multiple-probe-across-participants design found a functional relation between MSBI and an increase in mathematical problem-solving behaviors. Two participants were able to generalize to nonfinance percentage-of-change problems.
Modified Schema-Based Instruction to Teach Secondary Students With Autism to Solve Percentage-of-Change Word Problems.
Note. A, B, C, and D refer to respective participants in study. F = female; M = male; ASD = autism spectrum disorder; ID = intellectual disability; BC-SMD = between-case standardized mean difference.
What Works Clearinghouse 4.0 criteria for determining level of evidence of causal relationship.
Skill deficit.
Performance deficit.
In two subsequent replications, Root et al. (2018) and Root, Cox, et al. (2021) refined measurement, added procedures for supporting reasoning, and increased opportunities for generalization. In both studies, secondary students with ASD/ID were taught to solve percentage-of-change problems within the context of finding the total cost when using a coupon (e.g., 15% off coupon for a $20 car wash) and had opportunities to generalize skills from word problems to real-world stimuli (e.g., receipts and coupons). Each study found a functional relation between MSBI and problem-solving skills. Although these findings contribute to the evidence base supporting the effectiveness of MSBI for secondary students with ASD/ID, participants demonstrated differential patterns of responding and needs for supports. Table 2 outlines the response-guided changes made by researchers based on skill and performance deficits.
Study Dimensions Held Constant and Intentionally Varied Between Studies.
Note. ASD = autism spectrum disorder; BC-SMD = between-case standardized mean difference.
Items intentionally varied.
In their next study targeting percentage-of-change problems, Root and colleagues considered the nonmathematical behaviors necessary to generalize personal finance skills given the intended outcome of mathematics instruction is generalization (i.e., authentic application). Browder et al. (2018) argued it is insufficient to teach students what to do without when or why. Completing percentage-of-change tasks in the real world requires mathematical reasoning and social interactions. This motivated Root et al. (2022) to combine MSBI with an evidence-based social intervention (video modeling; Steinbrenner et al., 2020) via augmented reality (AR).
Root et al. (2022) used three types of videos along with MSBI to develop social and mathematical problem-solving behaviors of four 21-year-old students with ASD/ID who were enrolled in a postsecondary transition program. The participants independently used the camera on an iPhone or iPad within an AR application to trigger and display each video. Anchor videos depicted a young adult making a purchase in a community location and leaving a tip. A second video (i.e., social problem solving) continued in the scene with the same young adult reviewing their receipt and modeling how to notify the employee of an error. Using a general-case approach, the social problem-solving videos depicted a range of social responses from employees and appropriate responses from the young adult. After solving the word problem, participants could change their own answers after watching a point-of-view video model of a researcher solving the problem (with narration). There was a functional relation between the intervention and personal finance behaviors, but self-correction was variable. Although three of the four participants demonstrated generalization of their problem-solving skills to the local campus snack bar postintervention, lack of repeated measures limits making a causal inference.
Understanding Boundaries of Generalization via Replication Research
Replication of effects is central to the logic of single-case research designs (Sidman, 1960), the methodology used in 83% of the intervention studies for students with ASD published between 1990 and 2017 (Steinbrenner et al., 2020). A systematic literature review on word problem solving for students with ASD found all studies using MSBI used single-case designs, primarily multiple probe across participants (Root, Ingelin, et al., 2021). The emphasis on replication in single-case research design is reflected by the field's continued agreement that (a) demonstrations of effect replicated at different points in time is necessary for causal inference and (b) effect size estimates should be secondary to visual analysis (Ledford et al., 2023). Further, determination of evidence-based practices relies on systematic replications across participants, location, and research teams to increase external validity of generalizable results.
As researchers continue to use single-case research designs in the quest to identify effective practices for individuals with ASD, Ledford et al. (2023) urge us to attend to the contexts in which an intervention has demonstrated effectiveness (or ineffectiveness) over simply summing the quantity of evidence. They argue the later approach does not capture the limitations of evidence, ignores boundaries on generalization, and leaves questions about social and external validity. It is not possible to answer “what works, for whom, under what conditions” without explicitly examining the dimensions of an intervention (operationalizing the “what”), how its outcomes are measured (defining “works”), the characteristics of participants (specifying “for whom”), and the multifaceted aspects of the contexts (analyses of “under what conditions”) in which it has been implemented.
Present Study
The purpose of this study was to extend understanding of for whom and under what conditions MSBI is effective for teaching word problem solving to secondary students with ASD. We used Coyne et al.'s (2016) framework for replications and the next-generation guidelines for single-case research (e.g., Ledford et al., 2023) to guide this distal conceptual replication of Root et al. (2022). Table 2 outlines the contextual and instructional variables that were intentionally varied between the two studies. We addressed four research questions: (a) What is the effect of MSBI and video-based instruction delivered via AR on the personal-finance problem-solving skills of four high school students with ASD/ID measured by visual analysis? (b) How often do the students use the point-of-view video model to self-correct? (c) In what ways do the students generalize from word problems to a real-world setting? (d) How do the findings from this study compare with those from the Root et al. (2022) study?
Method
Participants
After receiving approval from the university, school district, and principal, the researchers provided information about the research project with the lead special education teacher at a large local high school, who then shared with two other special education teachers. All three were White and had between 5 and 15 years of experience teaching special education. To reduce influence of teacher biases about student readiness and/or behavior, we asked for consent forms to go home to all students who met the following broad inclusion criteria: (a) diagnosis of ASD or eligibility for special education services under the Individuals With Disabilities Education Act (IDEA) in the category of autism, (b) reliable communication (verbal or augmentative and alternative communication), (c) enrollment in Grades 9 to 12 at a public high school, and (d) eligibility for the state's alternate assessment. All consented participants were systematically screened to assess their ability to (a) identify money receptively and expressively, (b) identify the word “tip” and the figure “20%” expressively and receptively, (c) use an application-based calculator on a smartphone, (d) complete double-digit addition, (e) subtract and multiply with and without decimals with a calculator and, (f) solve percentage-of-change word problems. Additional facilitating skills were noted during the screening, including (a) willingness to work with the researchers, (b) stamina for attending to the task, and (c) reading skills (verbal or written). Students were invited to participate if they had prerequisite skills (multidigit addition, subtraction, and multiplication using a calculator) but were unable to solve word problems.
Due to district institutional review board requirements, researchers do not know the number of consents sent home, but all four students with signed parental consent forms assented to screening, met inclusion criteria, and assented to participate. Researchers used screening performance along with teacher input to determine any additional necessary supports for intervention (e.g., behavioral supports, instruction on using the technology). Researchers administered the Test of Mathematics Abilities–Third Edition (Brown et al., 2012) to participants prior to baseline; all four participants scored in the 1st percentile for both computation and word problems subsections.
“Devon” was a White male student and 19 years old at the time of the study. Devon was able to read monetary amounts and was the only participant who expressively and receptively shared experience using money in the community. He inconsistently used a calculator to solve addition and multiplication problems with decimals. Devon bit his arm in response to frustration. Research sessions were extended to accommodate frequent breaks (as written in his individualized education program [IEP]) to create lists of preferred topics (movie titles). Devon typically received support to stay engaged and on task (reported across teachers and locations in IEP data). He was agitated, frustrated, and self-injurious (i.e., biting his hand) during baseline when instruction was withheld.
“David” was a Black male who was 16 years old at the time of the study. David enjoyed working with the research team and frequently asked when the researchers would work with him. David's classroom teachers described him as a student who was eager to complete academic tasks and that he often took longer to get started than what they expected. During screening sessions, David smiled when it was time to work with the team and made eye contact with the interventionist before writing his answer (which we interpreted as waiting for instruction or a prompt). Prior to the intervention, David was able to identify money expressively and receptively and used an application-based calculator to add and multiply with decimals.
“Wes” was a ninth-grade White male who was 16 years old at the time of the study. He was supported in all classes by a one-to-one paraprofessional. Wes receptively identified money ($6.59) but could not expressively state the amount. He was able to use a calculator to add and multiply numbers without decimals and demonstrated fine-motor difficulties when writing.
“Trevon” was a 16-year-old Black male student. Due to district transportation policies, Trevon's family had to drive him to and from school. The family reported his absences from school (observed in Figure 1) were related to this policy and the medical needs of other family members. On the prescreening tool, Trevon identified money receptively but not expressively. He was able to use a calculator to add and multiply two-digit numbers with decimals.

Effectiveness of modified schema-based instruction.
Setting and Interventionists
All participants were enrolled in a large public high school located in the southeastern United States. Participants attended nonacademic courses (i.e., electives) with same-age typically developing peers. Participants changed classes with the school bell schedule and attended academic classes (e.g., math, English) that were aligned to alternate achievement standards and taught by certified special education teachers. All participants attended a mathematics class for 45 min each day. According to the mathematics teacher, instruction during the time of the study was on addition and subtraction of equations with a calculator and identifying and counting money.
Sessions took place in a separate classroom or office space and lasted approximately 30 min two or three times per week. The time of sessions rotated so that participants were not consistently pulled from the same class. Generalization sessions took place at counter-service restaurants in the food court of a nearby mall at the time the mall opened (10:00 a.m.). The special education teacher provided transportation to the mall for all generalization sessions.
Three members of the research team rotated as interventionist, as this was a variable held constant between the original study and this replication (see Table 2). Each was a White female. The lead interventionist (first author) had a PhD in special education with a focus on mathematics instruction for students identified with autism. She was a postdoctoral fellow at the time of the study, had experience as a classroom teacher, and was a parent of a child with an IEP in the same district. The second interventionist (fourth author) was a graduate student completing her final year of a teacher preparation program and had prior experience with individuals with ASD/ID as a Registered Behavior Technician. The third interventionist (third author) was a PhD student in special education who had experience as a classroom special education teacher in a high school setting in a different city. The first and second authors developed the intervention and trained the two graduate students to serve as “teachers” to maintain integrity of the carefully controlled replication. The lead interventionist facilitated a 2-hr training for the second and third interventionists using behavioral skills training.
Design and Measurement
This study followed the Ledford et al. (2023) guidelines to carry out a concurrent multiple-probe-across-participants design that included four potential demonstrations of effect. The intervention combined MSBI and video-based instruction using AR to teach the mathematical computation and social skills needed to complete a transaction with leaving a tip (i.e., personal-finance problem-solving skills). Three student outcomes were measured: (a) level of personal-finance problem-solving skills (primary dependent variable), (b) frequency of self-corrections, and (c) demonstration of generalization (see operational definitions later).
All four participants began baseline simultaneously, as shown in Figure 1 by three overlapping initial baseline data points. Devon began intervention first after three baseline probes of stable data due to ethical concerns over the impact of continued baseline sessions. Researchers observed increasing anxiety and self-injurious behavior caused by the baseline conditions and subsequently prioritized Devon's request for help over the recommended five baseline data points. When researchers did not provide him help solving the baseline problems after he asked. The remaining three participants were randomly assigned to tiers using a random generator in Excel because their baseline data were stable and there were no ethical concerns. Participants stopped engaging in intervention once they had a stable data pattern indicating mastery, which was defined as three intervention sessions with at least 10 out of 12 behaviors completed independently correct, including the correct final solution.
Effect Size
In addition to visual analysis, between-case standardized mean difference (BC-SMD) was calculated using the free scdhlm web application to estimate the magnitude of the effect (Pustejovsky et al., 2023). BC-SMD is referred to as the “design-comparable effect size,” which is recommended by the What Works Clearinghouse (WWC; 2022) and can facilitate the comparison of effects between single-case research designs and group designs. This quantitative metric uses multilevel modeling to calculate an average effect across the participants, allowing for comparison to the replicated study. BC-SMD also allows researchers to account for trend and variability between cases (Shadish et al., 2015). After visual analysis of the data, the first author confirmed the data met the assumptions for BC-SMD outlined by Valentine et al. (2016). Due to the time trends observed in the data, REML estimation was chosen over moment estimation. Data trends were similar to the previous study, and therefore “level” was selected for baseline phase and a “change in linear trend” was selected for the treatment phase. Fixed effects and random effects were included for baseline, and fixed effects for level and trend were included for intervention. To account for possible ceiling effects, we set the follow-up time as nine, predicting six intervention sessions would lead to observed effects.
Dependent Variables
The primary dependent variable was personal-finance problem-solving behaviors, which included the following three mathematical behaviors and three social problem-solving behaviors: (1) state why the receipt was correct or incorrect (social); (2) react appropriately to the receipt by using the receipt or requesting a new one (social); (3) calculate appropriate tip by multiplying total on receipt by .20 (mathematical); (4) write down tip amount in correct dollar format, including a dollar sign and decimals (mathematical); (5) calculate final cost by adding tip amount to total cost and writing answer in correct dollar format using dollar sign and decimals (mathematical); and (6) indicate what to do with signed receipt (social). This measurement aligned with the original study except for the first behavior, which was altered to address limitations expressed by Root et al. (2022). The initial measurement was whether or not the participant indicated they needed a new receipt, which did not measure if the student understood why they were or were not asking for a new receipt. Participants had the opportunity to demonstrate 12 behaviors across the two problems they solved in each session.
The second dependent variable was self-corrections of Steps 3, 4, 5, and 6 after watching the point-of-view model video (see Procedures for when and how this was assessed). Depending on how many steps they independently completed correctly, students had up to eight opportunities for self-corrections across the two problems.
The third dependent variable was demonstration of generalization. A member of the research team accompanied students to the food court of the mall to observe how their behavior when given an opportunity to make a purchase and leave a tip. Observations were recorded qualitatively to capture the participants’ behavior for each generalization probe.
Materials
Materials for the study were based on those used by Root et al. (2022). During baseline and intervention, materials included (a) a picture menu of 12 locations within the participants’ community where they could use the targeted mathematics skill (e.g., coffee shop, car wash); (b) worksheets that contained word problems and receipts aligned with the community locations, a schematic diagram, and AR markers (see Figure 2); (c) task analysis (described further later); (d) an iPad to use three apps (calculator, HP Reveal, and task analysis); and (e) researcher-created videos. Generalization materials included (a) cash, (b) a mock receipt for mall food vendor, and (c) an iPhone calculator app.

Worksheet from barber theme. Note. The worksheet was created by the research team. A version of this worksheet first appeared in Remedial and Special Education (Root et al., 2021).
As in the original study, the research team created a picture menu of 12 locations in the participants’ community where customers may leave a tip (e.g., barbershop, restaurant). Researchers followed recommendations for writing word problems (Spooner et al., 2017) to create two word problems for each community location. The word problems contained three sentences that introduced the community location, described the purchase made and asked what the total cost of the purchase would be with a tip (e.g., “You went to a florist to buy your mom flowers. You ordered one dozen roses. If you leave a tip, what will your total cost be?”).
As shown in Figure 2, each worksheet displayed (a) one word problem, (b) a correct or incorrect receipt, (c) the schematic diagram, and (d) markers to trigger videos using the HP Reveal app via the iPad's camera. Worksheets were all double sided to avoid faulty stimulus control (i.e., determining whether receipt is correct based on if worksheet has another receipt shown on the back). Each side contained the same word problem, but one side had a correct receipt (e.g., corresponded to word problem) and the other side had an incorrect receipt that did not correspond to the word problem (e.g., receipt had two dozen roses instead of one dozen).
The task analysis was the same as the original study but was displayed on a researcher-created noncommercial app designed for an iPad that had more supports than the original visual schedule app. The app had text-to-speech functionality (e.g., touching step would read it aloud) and a built-in response prompt. If participants skipped a step when self-monitoring (e.g., tried to check off Step 4 when Step 3 has not been checked off), the app would highlight the skipped step. Participants also had the choice to use an identical task analysis printed on an 8.5-by-11-inch paper.
The same three types of videos were used as the original study (anchor, social problem solving, and model). Details on when and how the videos were used can be found in the Procedures section. Anchor videos gave context for applying the mathematical skill in each of the 12 community locations. They showed a young adult completing a purchase with a narrator providing an explanation (e.g., “A coffee shop is a nice place to go to share a drink with a friend. . . . You tip your barista to thank them for their hard work making your drink”). Social problem-solving videos for each community location gave examples of the range of social responses from employees when a receipt was incorrect (e.g., apologetic, defensive) and examples of how a customer should appropriately react. Model videos showed a member of the research team solving the word problem using point-of-view video modeling and think-alouds.
Procedures
Prior to baseline sessions, all participants were given the chance to navigate between the phone-based calculator, HP Reveal, and the task analysis apps. Each participant was able to do this independently and did not require additional training or support.
Baseline
Baseline sessions began with the participant selecting the community location to serve as the theme for the videos and problems. Locations (and therefore problems) were not repeated until the participant had already completed all 12. Next, participants used the HP Reveal app to trigger the anchor video for the community location. Interventionists followed a predetermined random sequence for which side of the worksheet to display first (i.e., correct or incorrect receipt). The problem was read aloud by either the participant or interventionist before they asked, “Is this receipt correct?” followed by, “How do you know?” and finally, “Do you need a new receipt?” If the participant said yes, then the paper was flipped over (to the correct receipt side), regardless of whether they were correct. Next, the interventionist said, “Show me how to solve this problem.” The researcher provided generic praise (e.g., “You are working so hard) and pacing prompts (e.g., “What's next?”) but no affirmative or corrective feedback. Data were collected on two problems in the same format. At the end of each problem, the interventionist asked, “What would you do with your receipt after you wrote down the final cost?”
Intervention
Following procedures from Root et al. (2022), the interventionist provided explicit instruction across five lessons to build relevant vocabulary and conceptual and procedural knowledge. Next we describe the content of each lesson as well as if and how it differed from the original study. No data were collected on dependent variables because participants did not have an opportunity to independently perform the measured behaviors. This is reflected in Figure 1 by in the gaps between the phase lines and first intervention data points for each participant.
In Lesson 1, participants were taught vocabulary terms (e.g., “dollar,” “percent,” “addition,” “subtraction”), the concept of writing monetary values, and what to do with a completed receipt. This first lesson differed from the previous study in that we intentionally introduced the written notation of money in the form of pennies. During the second lesson, participants were taught to convert whole numbers and numbers with one, two, three, or four decimals into monetary amounts (i.e., 4.5728 = $4.57). In the original study, there was no explicit instruction in converting numbers into money, only in rounding thousandths to the hundredths place. In the third lesson, participants were introduced to percentage-of-change problems and the percentage-of-change schematic diagram. The fourth lesson taught social components of leaving a tip, including writing the total cost on the receipt, signing their name, and returning the receipt to the employee. The final lesson for both studies combined the mathematical and social problem-solving skills and explicitly taught how to use the task analysis. After these lessons, participants solved two problems independently (i.e., “instructional sessions”), and data collection resumed.
Instructional lessons followed the same sequence as baseline (select community location, watch anchor videos, solve problem) except a multiple-opportunity-probe strategy was used for the first two behaviors because these impacted the participants’ opportunity for independent correct responses for subsequent mathematical problem-solving behaviors. First, the interventionist gave error correction if the participant did not have an independent correct response for the first social problem-solving behavior (stating whether receipt was correct or incorrect) and, if needed, for the second social problem-solving behavior (reacting appropriately to the receipt). Accuracy of the first mathematical behavior (calculate tip) is dependent on whether the receipt is correct. If the receipt was incorrect and the participant did not request a new receipt, the interventionist would explain the need for a new receipt and then model how to ask for a new receipt and instruct the participant to repeat the request. No corrective or affirmative feedback was given to give an opportunity for independent responding in the remaining mathematical and social behaviors.
When the participant indicated they had finished solving the problem, the interventionist prompted the participant to use the HP Reveal app to trigger the model video for that problem. Participants were told to use this video to check their work and make any changes they wanted. Once the participant was done with self-corrections, the interventionist asked what they would do with their receipt when they were done. Finally, the interventionist reviewed the problem with the student, provided behavior-specific praise, and helped the participant make any remaining corrections that were not self-corrected while watching the model video. Procedures were repeated for the second word problem for the community location.
Generalization
In Root et al. (2022), generalization was assessed only once after the students met mastery. In this study, generalization probes took place every other week throughout all conditions in the food court of a local mall. Researchers arranged opportunities for participants to apply mathematical and social purchasing skills. Before the class arrived, the interventionist gave a cooperating employee a mock receipt that included lines for the tip, total cost, and a signature. The mock receipts were randomly assigned to be correct (two items: a snack and a drink) or incorrect (more than two items).
Students sat with their classroom teacher at a table in the food court while they waited for their turn to order. Observational learning was avoided by having participants go to the restaurant counter individually. Each participant was given cash and an iPhone with the calculator app open. The interventionist told participants they could use the money to buy one snack and one drink. When they received the receipt, the interventionist observed whether they displayed any of the mathematical or social problem-solving behaviors.
Results
We sought to determine whether participant outcomes in Root et al.'s (2022) study could be replicated. Specifically, we focused on the effectiveness of the intervention for increasing personal-finance behaviors, self-corrections using a point-of-view video model, and generalization of social and mathematical purchasing skills. Figure 1 displays the graph of independently correct personal-finance problem-solving behaviors for each participant across baseline and intervention sessions.
Intervention Effectiveness
Two members of the research team conducted visual analysis of the graphed data in Figure 1, with agreement that there was a functional relation based on occurrence of expected changes observed in three of the four possible opportunities (Ledford et al., 2023). The design-comparable effect size (BC-SMD) further supports a significant and meaningful observed effect.
Visual Analysis
Visual analysis included six data aspects: (a) level, (b) trend, (c) immediacy of effect, (d) variability, (e) overlap, and (f) consistency (Manolov & Onghena, 2022). Three of the four participants (Devon, David, and Trevon) exhibited an immediate (from the final baseline to first intervention) level change (i.e., mean) in measured behaviors after implementation of the intervention. All four participants had zero trend baseline performance, and the split middle technique (Lane & Gast, 2014) was used to compare baseline and intervention trend. The observed trend for one participant (Wes) was above baseline but below expected levels. There was occurrence of overlapping baseline and intervention data (Wes).
Effect Size
The estimated magnitude of the effect was calculated with BC-SMD (comparable to Cohen's d). The intervention was significant (t value = 3.38, p < .01). With 95% confidence, the effect was greater than zero (confidence interval = [0.66, 2.99]), and the observed magnitude can be interpreted as moderate to large (BC-SMD = 1.82, SE = 0.54). These values should be interpreted with caution, as the observed baseline data for all participants were zero and therefore contain no within- or between-person variation. The selected model assumes normal distribution across the study and therefore may inflate the observed effect.
Devon
Devon's intervention data were above baseline levels (0) and had an upward trend across Sessions 1 to 4. After plateauing in Sessions 5 and 6, we determined errors resulted from skill deficits and that he needed a more explicit task analysis that broke steps down further. There was an immediate increase in accuracy (jump from 6 to 11) when he was given the more explicit task analysis (shown in Figure 1 by a star). The eighth session followed a 2-week school break, which may have been the reason for the dip (from 11 to 3). After five additional sessions of ascending but variable performance, we determined the variability to be a result of a performance deficit. To increase motivation, Devon was taught a self-management routine that included self-monitoring accuracy (rather than just completion), goal setting, and self-reinforcement for meeting goals (5 min of preferred activity). This is indicated by a hexagon in Figure 1. He reached mastery criteria (two sessions above 80% correct) on the 16th session.
David
David's data show an immediate jump and a clear level change (from 0 to 9) from baseline to intervention. Data remained stable through the 2-week school break (range 9–12), providing some evidence of maintenance. David met mastery criteria after six sessions. No changes were needed to address skill or performance deficits.
Wes
There was 100% overlap between data in baseline and the first two intervention sessions. On the basis of Wes’s responding during model and guided-practice problems, we suspected a performance deficit related to motivation and began the same self-management routine that Devon was following, marked by a hexagon in Figure 1. Although there was a slight level increase once he began following self-management routine (0 to 1–2), there was a countertherapeutic trend between the third and sixth intervention sessions (2 to 0). The research team determined Devon was not receiving adequate mastery-oriented feedback and would benefit from response prompting at the step level rather than after completion of the whole task. System of least prompts was chosen because it has demonstrated effectiveness in prior MSBI studies (Root, Ingelin, et al., 2021). Data were collected using the same procedure as these prior studies; only independent correct behaviors are graphed. A second phase line in Figure 1 indicates this change. Wes's performance remained variable (average 2, range 0–5) with a minimal level change following the use of response prompting in the final five intervention sessions.
Trevon
The intervention had an immediate effect for Travon (jump in level from 0 to 4). However, the trend remained stable (3–4) for the first three sessions. The interventionists observed that he did not persist during independent practice the way he did in guided practice, but it was unclear as to whether this was a performance deficit (i.e., unsure of how to solve the problem) or performance deficit (i.e., not motivated to go through the complete behavior chain). Researchers agreed that using a response-prompting procedure (system of least prompts) was a solution that would provide more intensive instruction to address skill deficit as well as increased frequency of behavior-specific praise to address performance deficit. Trevon's performance immediately increased in level (4 to 9) and trend until he met mastery criteria after the sixth intervention session.
Self-Corrections
Participants had the opportunity to self-correct Behaviors 3 to 6 only if they were incorrect following independent practice. Therefore, participants had up to eight opportunities for self-correction each session. Figure 3 is a graphical display of self-corrections. During the first several sessions, Devon inconsistently used the point-of-view video model to self-correct his responses. He accurately self-corrected on 10 out of 63 opportunities. David had fewer opportunities for self-correction. Over his six intervention sessions, he self-corrected the final mathematical behavior (writing monetary value of total cost on receipt) on 3/4 opportunities. In contrast, Wes had the most opportunities for self-correction across his first five sessions but was never successful. This contributed to the team's decision to use a response-prompting procedure (system of least prompts). Trevon did not self-correct for any of the 22 opportunities in the first three sessions before response prompting began.

Self-corrections. The solid black portion of the bar graph indicates how many steps the student independently completed. Steps the student self-corrected after watching the video model are shown in horizontal stripes. Solid gray portions of the lines indicate the number of steps that remained incorrect. Wes and Trevon have fewer sessions represented, as a system of least prompts was introduced after the fifth and third sessions (respectively).
Generalization
Initially, participants needed support to carry out every aspect of purchasing a snack and a drink (e.g., making a choice, telling the employee, paying). Every participant took their food and drink and turned to go back to their table before the employee had an opportunity to give them their change or a receipt. As trips continued, participants became more independent or initiated aspects of making a purchase (e.g., telling cashier their order, taking change).
All participants increased their independence and demonstrated some generalization over the course of the study, though we cannot make a causal inference that it was a result of the intervention and not practice effects. Everyone began to independently carry out social purchasing skills (e.g., ordering food, paying, waiting for change and a receipt) that, although not measured by the dependent variable, are necessary for independence. During the final generalization trip, David waited for a receipt and told the interventionist that the receipt was correct. During that same trip, Trevon used the mock receipt to calculate a tip.
Interobserver Agreement and Fidelity
A member of the research team took data for interobserver agreement (IOA) and procedural fidelity (PF) during baseline and intervention sessions either via live observations or by watching video recordings of sessions. Point-by-point agreement (Ledford & Gast, 2018) was used to monitor adherence of data collection to the coding manual to check for observer bias drift. When IOA or PF fell below 90%, disagreement discussions were held between the interventionists to ensure alignment to procedures and coding.
Both IOA and PF met WWC (2022) quality standards (above 80%) across all participants and conditions. In baseline, 66% of Devon's sessions were observed for both IOA and PF, with 100% agreement and an average of 94% fidelity (range 88%–100%). In intervention, 31% of Devon's sessions were observed for IOA and PF. The average IOA was calculated at 97.6% (range 88%–100%), and PF was 100% across all sessions. David had 67% of his baseline and intervention sessions observed, and IOA and PF were both 100% on all observations. In baseline, Wes had 50% observed for IOA and PF, with all sessions reaching 100% agreement and 100% fidelity. Wes had 38% of intervention probes observed by a secondary coder, with 100% agreement for all sessions and 96.67% fidelity (range 90%–100%). IOA and PF were coded for 71% of Trevon's baseline sessions with 100% agreement and fidelity. During intervention, Trevon had 60% of sessions observed with 100% agreement and fidelity.
Social Validity
Researchers interviewed the four participants using a printed multiple-choice survey that had visual supports about their experience during the research. For example, students were asked, “What do you think you learned?” Other questions asked students to reflect on when they might use what they learned, for example, “Give an example of time that you would leave a tip for someone.” We also asked the students what made it easier or more difficult for them to solve the problems and if they would recommend a friend to participate in a similar research study.
Everyone said they would tell their friend to participate in the study. Three of the four participants also stated they learned something new during the study, including (a) how to multiply (Devon, David) and (b) percentage of change (Devon, Trevon, Wes). When asked for an example of when they might use what they learned about percentage of change, both David and Devon selected “in stores,” Trevon selected “to tip a person,” and Wes selected both “in stores” and “to tip a person.” Three of the participants felt it was important to give a waiter or waitress a tip (Trevon selected no). The checklist (David and Devon), the calculator (Devon, Trevon, and Wes), the graphic organizer (Devon), and practice opportunities (Trevon) were highlighted as things that helped them solve the problems. Rounding (Devon and Wes) and reading the word problems (Devon and Trevon) were things that made solving the problems difficult. All participants reported informally that they enjoyed going to the mall and buying food. The teacher indicated the trips to the mall were positive and beneficial experiences.
Discussion
Replication is well recognized as critical to accumulating empirical evidence and establishing evidence-based practices (Therrien et al., 2016). Coyne et al. (2016) point out that “special education researchers clearly value programs of research that produce cumulative and converging evidence about practices and interventions” (p. 244). Yet there is a relative scarcity of special education publications that are explicitly identified as such (Makel et al., 2016). In this conceptual replication, we evaluated effects of an established evidence-based strategy for teaching word problem solving to students with ASD (Root et al., 2022) to further understand for whom and under what conditions MSBI is effective.
First, we discuss how recommendations from Coyne et al. (2016) and Ledford et al. (2023) were incorporated into our design, implementation, analysis, and reporting decisions. This is followed by a discussion of how our methodological decisions resulted in social, internal, and external validity limitations using recommendations provided by Ledford et al. (2023). Finally, we end by providing recommendations for researchers using single-case designs to conduct, interpret, and report on replications and implications for practice.
Recommendation: Specify Replication Components
We varied aspects of three out of six dimensions identified by Coyne et al. (2016), as shown in Table 2. These decisions were made a priori; our purpose was to evaluate the effectiveness of MSBI for secondary students who were still in high school as opposed to already in a transition program. Therefore, there were intentional differences in the school type and participants in terms of their age but not disability (ASD/ID). We maintained researchers as interventionists in this replication, as a teacher implementation would have been a consequential difference in implementation conditions. Although we cannot definitively expect replication of these results in a teacher-implemented intervention based on these findings, recent published studies have shown teacher implementation is feasible (e.g., Browder et al., 2018; Root et al., 2022; Root, Jimenez, et al., 2020).
Given the limitations in measurement of outcomes and generalization reported by Root et al. (2022), we incorporated their suggestions by altering these two dimensions. We revised participant expectations for the first behavior from simply providing a yes/no answer about accuracy of receipt to stating why a receipt was incorrect or correct. Second, we increased the frequency of generalization opportunities from once to biweekly and conducted them within a novel community location instead of a familiar school-based enterprise on their campus.
Recommendation: Design and Conduct Distal and Conceptual Replications
Conceptual replications can be considered “closely aligned” or “distal” (Coyne et al., 2016) based on the amount of overlap between the initial study and replication study. The current study should be considered a distal conceptual replication because we systematically investigated the effects of the intervention under intentionally different conditions. This was done by manipulating contextual and instructional variables of both theoretical and practical importance. The participant, setting, and outcome variables we intended to manipulate were determined a priori based on the limitations identified by Root et al. (2022), as shown in Table 2. In contrast, a closely aligned conceptual replication would have taken place under very similar conditions, such as with young adults in a transition program, for the purpose of ensuring initial effects were not due to error, bias, or chance (Coyne et al., 2016; Sidman et al., 1960).
Recommendation: Use Multiple Approaches to Interpret Findings
Both studies used visual analysis as the primary method for analyzing effects, with an effect size estimate (BC-SMD) used to estimate the magnitude of the differences between baseline and intervention (Ledford et al., 2023). Importantly, the change in expected response discussed earlier for the first social problem-solving behavior did not alter the dependent variable's index of opportunities to demonstrate 12 behaviors across two problems in each session. This allows for direct comparison of graphed displays of data across the two studies as well as effect size estimates.
Coyne et al. (2016) caution against relying solely on comparison of magnitude or direction of effects. The contextual factors surrounding single-case data, such as an individual's behavior characteristics or the social significance of outcomes in relation to dosage of intervention, impacts the interpretation of single-case data by “pos[ing] a soft edge to approaching the design and analysis of a case” (Ninci, 2023, p. 20). Ledford et al. (2023) recommend single-case researchers report their a priori expectations for changes between conditions and detailing the extent to which results correspond to account for contextual factors.
We expected all participants to have low but stable baselines based on both the findings of Root et al. (2022) and the consistent pattern of baseline performance across prior studies teaching secondary students with ASD/ID to solve percentage-of-change word problems using MSBI (see Table 1). This a priori expectation was met. We also expected all participants to have a change in level with ascending trend by the third intervention session. We did anticipate some variability and were prepared to respond to error patterns by analyzing discrete behavior-level data to determine whether it was a skill or performance deficit. Only three of the four participants had an immediate change in level with ascending trend after beginning intervention.
Three of the four participants in the current study demonstrated a clear and socially significant level change in personal-finance behaviors at three different points in time, indicating a functional (causal) relation between the intervention and targeted skills (Ledford et al., 2023).
Limitations From Methodological Decisions
We used Ledford et al.'s (2023) recommendations for improving internal, external, and social validity to examine limitations that are results of our methodological decisions. Regarding internal validity, we are unable to draw a conclusion about the effect of MSBI on generalization because we did not collect enough data to establish level, trend, and variability. This decision also hampers external validity. Additional concurrent data in the generalization context would have provided “the most compelling evidence of the presence or absence of generalized behavior change” (Ledford et al., 2023, p. 387). Relatedly, our methods for determining social validity did not guard against potential social desirability bias. Using naive researchers as primary data collectors could have decreased risk of bias in social validity as well as external and internal validity related to each dependent variable (Yoder & Crandall, 2019).
We plan to incorporate several of Ledford et al.'s (2023) suggestions in future studies, specifically, (a) using naive researchers, (a) collecting normative comparison data of peers who did not receive the intervention, (c) having unfamiliar researchers conduct participant interviews at multiple time points through the study to increase trustworthiness, and (d) preregistering. We did explicitly plan the study using a written protocol of our a priori decisions, maintained detailed meeting notes to document response-guided decisions, and transparently described the process of determining these decisions in this manuscript (Cook et al., 2021). However, preregistering the study would have not only increased transparency but also nudged the field's progress toward open science by providing a replicable and complete description of the research with maximum transparency (Appelbaum et al., 2018; Nosek et al., 2018).
Recommendations for Replications Using Single-Case Research Designs
Replications can be used to better understand the utility of evidence-based practices by examining them in terms of what works, for whom, and under what conditions. Researchers seeking these answers must attend to the dimensions of an intervention, the selection of and method for measuring outcomes, who did (and did not) receive the intervention, and the multifaceted aspects of the contexts in which it was shown to be both effective and ineffective (Ledford et al., 2023). Consideration of these distinct aspects of a single-case research design should impact how we determine the strength of a study through assessment of validity (internal, external, social), the stage of research, its purpose, and the degree of transparency in reporting.
To facilitate this nuanced approach, Ledford and colleagues (2023) identified seven types of evidence that can be accumulated for an identified practice. As applied to the evidence base for MSBI, the results of the current and original studies are relevant to four of these: (a) context-bound behavior change, (b) generalized behavior change, (c) generality across contexts, and (d) generality across participants. The remaining three types of evidence (cost-effectiveness, feasibility in typical contexts, and maintained effects) should be examined by researchers who are interested in further developing the evidence base for MSBI.
Recommendations for the Field
Participants included in this study are reflective of the student and teacher population that were observed in this specific school (i.e., 50% of the student body was White, and 80% the teachers were also White). Historically, autism-related research has disproportionally included White male participants (Steinbrenner et al., 2022). Replication research can provide additional information to guide practice and policy decisions by investigating specific contextual factors, such as students of color with disabilities who do and do not have access to teachers with whom they share racial identities. Although efforts were made to address potential bias in selection process, future research should prioritize strategic recruitment to address the well-documented disparities in access to evidence-based practices for individuals with ASD (West et al., 2016).
Although these findings have limitations, they provide additional evidence that MSBI is an effective strategy to improve problem-solving skills for many students with autism. To promote equitable outcomes for youth with ASD/ID, teachers should implement MSBI as outlined in previous manuscripts (e.g., Spooner et al., 2017) while being responsive to the interests and needs of their students. Progress-monitoring data at the step level can provide objective, formative feedback, along with other sources of data, such as student communication and teacher observations.
Footnotes
Funding
Research reported in this publication was supported by the Organization for Autism Research, Florida State University Council on Research and Creativity and the Office of Special Education Programs under award number H325D190024. The content is solely the responsibility of the authors and does not necessarily represent the official views of the U.S. Department of Education.
