The Effectiveness of Volunteer Tutoring Programs for Elementary and Middle School Students: A Meta-Analysis

Abstract

This meta-analysis assesses the effectiveness of volunteer tutoring programs for improving the academic skills of students enrolled in public schools Grades K–8 in the United States and further investigates for whom and under what conditions tutoring can be effective. The authors found 21 studies (with 28 different study cohorts in those studies) reporting on randomized field trials to guide them in assessing the effectiveness of volunteer tutoring programs. Overall, the authors found volunteer tutoring has a positive effect on student achievement. With respect to particular subskills, students who work with volunteer tutors are likely to earn higher scores on assessments related to letters and words, oral fluency, and writing as compared to their peers who are not tutored.

Keywords

meta-analysis tutoring school volunteers middle school programs program evaluation

In many cultures, the oldest form of teaching was provided to children by in-home tutors or private instructors (Shanahan, 1998). Tutoring remains a popular form of instruction worldwide, and the effectiveness of tutoring as a pedagogical method has been documented extensively in various strands of the educational literature (e.g., Cohen, Kulik, & Kulik, 1982; Fashola, 2001; Wasik & Slavin, 1993). During the 1970s, U.S. schools began relying more on peer tutoring (also known as student-to-student or cross-age tutoring) as a way to efficiently use scarce resources in a period of teacher shortages (Rekrut, 1994).

The 1980s and 1990s also witnessed an increased interest in tutoring programs staffed by adult volunteers for a variety of reasons: (a) increased public concern with the quality of education after the National Commission on Excellence in Education’s release of A Nation at Risk in 1983, (b) rising interest in community service, and (c) the encouraging results from effective yet costly programs that employ professional tutors. By 1987, the National Research Council estimated that there were more than 1 million volunteer tutors who donated an average of 4 hours per week in the nation’s public schools. The survey found that three fourths of public elementary schools in the United States reported the involvement of volunteers, with schools having an average of 24 volunteers (Michael, 1990).

More recently, the growth of university partnerships has accelerated because of the America Reads Challenge, a nationwide tutoring initiative launched in 1997 by President Clinton. In 1997, nearly 800 universities and colleges throughout the nation had already pledged to commit work-study slots for college students to serve as tutors for elementary school children (White House, Office of the Press Secretary, 1997). By 1999, nearly 1,200 colleges and universities committed to placing work-study students as tutors in public schools. President Clinton’s 1999 budget proposal also included $140 million to establish programs matching university-based mentors with students in schools that had high dropout rates and high concentrations of poor students. As a result of the America Reads Challenge, state leaders became increasingly interested in providing tutoring programs for elementary school children, and numerous local tutoring initiatives are now receiving increased support (U.S. Department of Education, 1996, 1997).

Despite the increased interest in and support for tutoring programs in the past few decades, the expansion of programs that use nonprofessional adult volunteer tutors has yet to be matched by a supporting research base. The lack of evidence for adult volunteer tutors is due, at least in part, to the fact that most tutoring research in the United States in the 1970s and 1980s was focused on the impacts of peer or cross-age tutoring (Shanahan, 1998). In 1982, Cohen, Kulik, and Kulik published a well-known and often-cited review of peer tutoring research in the American Educational Research Journal. The authors noted that hundreds of reports on tutoring had been written by teachers and researchers, some based on scientifically sound, experimental design evaluations and others more informal and subjective. The authors also found four major reviews of research on tutoring, including Devin-Sheehan, Feldman, and Allen (1976); Ellson (1976); Fitz-Gibbon (1977); and Rosenshine and Furst (1969). According to the authors, these reviews of the research used “relatively informal narrative and box score techniques” (Cohen, Kulik, & Kulik, 1982, p. 234), and each concluded that peer tutoring can help improve the academic success of young students.

During the late 1980s and 1990s, researchers began focusing more on specialized interventions aimed at improving the academic achievement of the lowest-achieving children, most notably, Reading Recovery and Success for All (Shanahan, 1998; Wasik & Slavin, 1993). Both programs include one-on-one tutoring by professional tutors and are perceived as effective by many in the research community (Wasik, 1998). However, because the expense of employing professional tutors limits the number of children who can be served by these interventions, several programs have been created in recent years that use adult volunteers or paraprofessionals as tutors. Some of these programs are responses to the accountability measures established by the No Child Left Behind Act of 2001, which encourage education administrators to implement new programs aimed at increasing student performance. As administrators encourage the use of tutoring programs in their schools, they may be doing so without solid evidence of what types of impacts specific tutoring programs can realistically produce. To begin understanding the effects of tutoring programs, we examined the existing literature for reviews. Previous work has identified four major reviews of research on the impact of volunteer tutoring programs on student outcomes: Topping and Hill (1995), Wasik (1998), Shanahan (1998), and Elbaum, Vaughn, Hughes, and Moody (2000).

In 1995, Topping and Hill provided interesting background on the research related to volunteer tutoring in a chapter contributed to the book Students as Tutors and Mentors (Goodlad, 1995). The chapter presents a review of the evaluation research around the world related to the effectiveness of college students as tutors for schoolchildren. Wasik (1998) reviewed studies of 17 programs that used volunteer tutors to help improve students’ reading abilities. Although the evidence suggested that volunteers could help many children improve their reading skills, the results varied considerably across programs. Furthermore, only 2 of the 17 programs reviewed compared students’ achievement with that of a control group, which makes the results less certain. Wasik’s findings are consistent with those of the National Research Council’s 1998 report, “Preventing Reading Difficulties in Young Children,” which concluded that there is no evidence to confirm that volunteers can deal effectively with children who have serious reading problems. Nevertheless, it is likely that there are other evaluations not included in either of these nonsystematic reviews, which could have altered the findings.

In contrast, Shanahan’s (1998) review of the research on volunteer tutoring found that despite many limitations, these programs can be effective in improving student achievement. However, Shanahan offers little detail about the methodology used in his review, and as with the two previous reviews, the results of several important studies may well be excluded because of the unknown methodology.

Most recently, Elbaum et al. (2000) reported on a meta-analysis in the Journal of Educational Psychology focused on the effectiveness of one-to-one tutoring programs for improving reading ability in elementary students at risk for reading failure. The authors reviewed 29 studies involving 42 samples of students between 1975 and 1998 and found that trained tutors can help students improve in reading skills. This review is helpful because it provides much information regarding the effects of tutoring by instructor type, intervention focus, and duration. According to their findings, students tutored by college students made the largest gains, interventions focused on reading comprehension produced the largest gains, and more intensive programs have more powerful effects. Additionally, Elbaum et al. concluded that “college students and trained, reliable community volunteers were able to provide significant help to struggling readers” (p. 616), which is contradictory to the findings from the National Research Council.

The limited and often-conflicting evidence on the effectiveness of volunteer tutoring programs provides the basis for our study. First, previous reviews of research are outdated, in that the most recent review concluded the literature search in 1998. We believe an updated review is necessary to reevaluate the extant literature on volunteer tutoring, especially in light of the National Research Council’s (1998) conclusions. Second, the most rigorous previous review, conducted by Elbaum et al. (2000), was not focused on volunteer tutoring (28 of the 42 studies used teachers as tutors) and used reading as the only outcome measurement; we are curious about the evidence of volunteer tutoring effects on mathematics as well as reading.

Our study focuses on responding to some of the inconsistencies and remaining questions regarding volunteer tutoring. Specifically, we ask, Is there good evidence to encourage policy makers and school leaders to continue to pursue volunteer tutoring as a possible strategy for improving the academic skills of young students? How do differences in participants (e.g., age, gender) affect the effectiveness of the programs? And how do differences in tutoring programs (e.g., program structure, program focus, types of tutors) affect the results? For our review, volunteer tutoring is defined as academically focused instruction delivered by nonprofessionally trained adults, which does include college students but not teachers.

The objective of this systematic review was to gather, summarize, and integrate the empirical research on volunteer tutoring programs to help policy makers, educators, parents, and other stakeholders understand whether this type of intervention might be an effective tool for improving academic skills for elementary students.

Method

Types of Studies

Because of the large number of studies examining the role of tutors in schools and the previous reviews, only randomized field trials were included in the review. Quasiexperimental studies that employ treatment and control groups matched on pretests of key outcome variables were not included in this review. Pretest–posttest studies, or those in which a treatment group is compared to another treatment group, were not included. The strict study quality rules were employed to expedite, simplify, and strengthen the review. Specifically, including quasiexperimental and pretest–posttest studies could make the review too cumbersome to manage. Also, including such studies would require creating many more decision rules regarding the type of quasiexperimental studies that are acceptable. That is, we would need to explain how we assured baseline equivalence between comparison and treatment groups, which might also require contact with study authors. Finally, using only randomized field trials, we contend, strengthens the review because of the rigor and generalizability of such studies. Although each of these points may not be enough to argue effectively for using only randomized trials, we find the culmination of these arguments persuasive.

Beyond a consideration of study design rigor, studies published before 1985 were not included. Only English-language studies of programs conducted in the United States were considered because of the limited resources for this review. Furthermore, we excluded studies of programs that were especially designed to address the needs of students with limited English proficiency (LEP) because such specialized programs are likely to be fundamentally aimed at teaching language skills alongside an academic focus.

Types of Participants

Only studies of programs involving adult, nonprofessional tutors were included. Although these tutors were almost always referred to as “volunteers” in the literature, those programs that pay a small stipend to tutors (such as undergraduate tutors who are tutoring as part of a federal or state work-study program) were also included. Several of the programs in this area are those that train parents to tutor their own children; such programs are included in this review and are coded for separate examination in the subgroup analysis. In terms of the tutees, only studies of programs that serve students in Grades K–8 (elementary and middle school) were considered because, due to limited resources, this is the population typically served by volunteer tutoring programs (Elbaum et al., 2000). The focus on early grades may result from the idea that students who fail to obtain basic reading skills in early grades often remain behind their peers throughout their education, and these students are at risk for school failure (Farkas, 1995; Karweit & Wasik, 1992). Ritter (2000) also noted the focus of tutoring programs in the early grades. Additionally, elementary programs are likely to be fundamentally different than those provided to high school students, and given the importance and prevalence of early programs, we chose to focus our review on this population.

Types of Interventions

The interventions featured regular tutoring sessions with an academic focus for at least 1 month in duration. The duration restriction was included because of a belief that programs lasting only a few days were qualitatively different than longer programs with sustained exposure. With regard to intervention focus, we did include interventions with other components in addition to an academic focus (such as behavior modification); however, the evaluation had to focus on at least one academic outcome measure.

Types of Outcome Measures

The original intent of the review was to consider all outcome measures related to student achievement, including distal outcomes (ones the program are actually intended to influence, such as school grades or standardized achievement measures) and proximal outcomes (intermediate measures that might be influenced by tutoring and then might lead to improved outcomes in the future, such as student attendance rates). However, the review yielded very few studies that analyzed school grades or attendance rates; rather, most studies focused on various standardized assessments of math and reading skills. As a result, our review focused on these outcomes.

Many of the included studies, particularly those with a reading focus, examined several outcomes, and it was the task of the reviewers to categorize those outcomes within our classifications of outcome measures. We organized the outcomes into six broad categories based on different concepts described within the retained articles. Listed below are the six classifications and the outcome measures that fit within each class.

Reading global. In this category, we included results from overall batteries on standardized reading achievement tests. The achievement tests in this classification include the Gates MacGinitie Reading Test, the Wide Range Achievement Test Reading section, the Comprehensive Test of Basic Skills Reading section, and the Reading Battery on the Stanford Achievement Test.

Reading letters and words. Many of the reading-focused studies examined multiple outcomes related to reading subskills that we define here as being in the “letters and words” category. The types of measures that are included in this class are those that focus on decoding of words and knowledge of words. The underlying logic of this classification scheme is that there are certain skills that are required before students can be expected to read well, and these skills are related to being able to read words and understand what they mean.

The outcome measures that we included within this class are generally subtests within the general reading standardized assessments (examples of standardized assessments with reading subtests include the Gates MacGinitie Reading Test, the Wide Range Achievement Test Reading section, the Comprehensive Test of Basic Skills, the Woodcock-Johnson Psycho-Educational Battery, and the Test of Word Reading Efficiency). Examples of decoding outcome measures include Word Identification, Word Attack, Letter Identification, Dynamic Indicators of Basic Early Literacy Skills (DIBELS) fluency assessments and Word Recognition and Vocabulary tests. Also included in this category are subtests focused on such topics as consonant sounds, short vowels, digraphs and combinations, sight words, and nonword decoding. Some of these measures are not standardized; for example, the Morris, Shaw, and Perney (1990) study uses a word recognition outcome that is not related to any standardized assessment.

Although there are numerous ways that these categories could be divided, the goal was to develop a reasonable number of categories that included similar types of outcome measures. In this category, the focus is on the particular subskills necessary for young students to become fluent readers.

Reading comprehension. In this category, we included results from comprehension subtests of standardized reading achievement tests. The comprehension subtests used in studies in this review are from the Gates MacGinitie Reading Test, the Wide Range Achievement Test, the Comprehensive Test of Basic Skills, the Stanford Achievement Test, and the Woodcock-Johnson Psycho-Educational Battery.

Reading oral fluency. Many of the reading-focused studies employed outcome measures examining the ability of students to quickly and accurately read passages out loud. Such outcomes typically required that students read a passage and rated the students on the basis of the number of words read correctly. The outcome measures in this class include the following: curriculum-based oral fluency, basal passages, observational survey of reading level, Analytic Reading Inventory (fluency), and others.

Writing. Six studies in this review employed outcome measures categorized as writing measures. The outcome measures in this class include the following: spelling, observational survey of writing, observational survey of dictation, the Spelling sub-test of the Wide Range Achievement Test, the number of words written and spelled correctly in a writing sample, and a curriculum-based spelling measure.

Mathematics global. Five studies in this review employed outcome measures categorized as mathematics measures. The measures of math achievement in this class include the following: a researcher-developed multiplication test, the Math subtest of the Stanford Achievement Test, and the Orleans-Hanna Algebra Prognosis Test.

Search Strategy for Identification of Studies

For this review, titles of studies on volunteer tutoring programs were identified by several methods. First, we searched the C2-SPECTR and EbscoHost Research Database using the following databases: Academic Search Premier; Primary Search; Professional Development Collection; Middle Search Plus; Psychology and Behavioral Sciences Collection; PsycINFO; Sociological Collection; ERIC (Education Resources Information Center); and Proquest Digital Dissertations. The initial search included the following keywords in various combinations and truncations: volunteer or mentor or tutor* or tutorial programs and elementary or primary education or middle school students or junior high school students or early intervention and control or random or experiment or evaluation or program not peer.

To find any articles that have yet to be updated in the electronic databases, we conducted online reviews of the table of contents of several major journals that are most relevant to our study, including Education Next, Education Policy Analysis Archives, Educational Evaluation and Policy Analysis, Reading Research Quarterly, and Review of Educational Research for the years 2003 through 2005. To ensure policy relevance, we included only studies conducted in the United States with native English speakers. That is, interventions employed in educational systems in other countries would likely be different than the education policy environment in the United States.

The resulting list of articles was augmented by other research studies referenced in the four widely cited reviews on volunteer tutoring discussed previously: Topping and Hill (1995), Wasik (1998), Shanahan (1998), and Elbaum et al. (2000). We also consulted with several sources to refine the search process, including an information specialist or librarian, a reading specialist, and the “Campbell Collaboration’s Information Retrieval Policy Brief” (Rothstein, Turner, & Lavenberg, 2004). Studies were retrieved primarily from the University of Arkansas Library System, Interlibrary Loan, University Microfilms, and the databases listed above. All study titles and inclusion decisions were documented and managed using Excel software to maintain accuracy and consistency among the reviews. When possible, Portable Document Format files of all articles were saved in a central network folder; hard copies of these and print-only articles were also kept on file.

The list of study titles generated in this process was then narrowed through a review of the studies’ abstracts by at least two reviewers. No outside reviewers were used in this study; the authors served as reviewers throughout the meta-analysis. After the abstracts were retrieved and reviewed, both reviewers reviewed the full text of all studies chosen. Studies that passed the initial full-text review were passed into the full-coding stage. After these studies were fully coded (each coding involved at least two reviewers), the final set of studies that met all inclusion criteria was then analyzed, and the results were synthesized. If the two reviewers arrived at different conclusions during the coding process, the coders reconciled whether to keep the article. The reconciliation process consisted of a meeting in which each coder explained the rationale for retaining or rejecting the article until agreement was reached. If agreement was not reached, the coders would default to retain the article. In the final stage, the lead reviewer settled any disagreements.

Selection of Trials

Initially, 1,437 studies were identified with the search terms in the search databases—some of these studies were duplicates from multiple searches. Of the total studies identified from the database searches, hand searches, and review of previous reviews, we retained 969 unique study abstracts to be reviewed. The abstracts were collected and two coders independently read each abstract to determine whether the study met the inclusion criteria. The intercoder reliability rating was 95.1%, where both coders reached the same conclusion to reject or retain the article. All studies where different conclusions were reached were discussed between the two coders to reach a consensus. We eliminated abstracts based on the following guide:

The article reported a meta-analysis.

The study did not employ an experimental design (e.g., case study, narrative, brief analysis or report of fewer than two pages in length, or qualitative study).

The study was conducted prior to 1985.

The tutees resided outside the United States or were non-English speaking (i.e., English as a Second Language, LEP).

The tutors were peer or professional tutors (i.e., not volunteer adult tutors).

The students fell outside the parameters of Grades K–8.

The focus of the intervention did not have at least one academic outcome.

The intervention duration was less than 4 weeks or 1 month.

The student population was specialized (e.g., deaf, blind).

For a study to be eliminated, the abstract had to include information clearly indicating that the study met one of the exclusion criteria described above. If the article did not provide enough information for us to eliminate the article, then the article passed to the next round for further analysis.

A majority of the articles excluded during the first round were excluded for the following reasons: The studies did not employ an experimental design, the tutors were not volunteers, the tutees were not English speaking or resided outside the United States, or the tutees fell outside the parameters of Grades K–8. The initial screening stage narrowed the pool of studies considerably, from 969 studies to 233 studies.

For the second step in the review, the full text of the remaining 233 articles was reviewed by two of the authors. The initial evaluation of the full text was focused on the Introduction and Method sections of the article. The purpose of this step was to achieve a fuller understanding of the article without having to fully code every retained article. First, we examined the introduction of the article to gain a fuller understanding of the intent of the authors of each study. Second, we examined the article to make sure it reported findings of a study rather than simply described a program. Third, we reviewed the Method section of each article to ensure the students were randomly assigned to a treatment and control group. Throughout this full-text review process, we eliminated articles on the basis of the general exclusion criteria described above. This stage also resulted in a substantial narrowing of the study pool, reducing the number of articles from 233 to 56.

Finally, we fully coded each of the 56 retained articles. In the end, 21 articles were included in our meta-analysis, although we had 28 study “cohorts” from those articles because 6 of the articles had multiple cohorts requiring separate coding. For example, the Cobb (2001) article reported results for each of three grades separately and did not provide any pooled data; thus we included the results of this study as three separate Cobb (2001) study cohorts.

A total of 35 articles were excluded during the final full-coding phase. Of the 35 excluded articles, 30 were eliminated because of the following reasons: The intervention was not of sufficient duration, the study design was not a true randomized field trial, the program intervention was implemented outside the United States, there were no relevant academic outcomes, tutoring was not face-to-face, and the study used professional rather than volunteer tutors. The other 5 articles excluded from the meta-analysis were eliminated because of quality concerns with the statistics computed in those studies. The five studies in question reported insufficient statistics for us to include in our meta-analysis.

With respect to study design, the research community commonly accepts that randomized designs are the strongest designs on which to base causal inferences. Initial exploration of the available literature on tutoring uncovered numerous randomized field trials. Choices in the review were based on the premise that we should use the most reliable evidence available. Had we found only a handful of randomized field trials, we would have then chosen to include high-quality quasi-experimental designs. However, because we identified nearly 30 study cohorts within more than 20 studies or reports based on randomized field trials, we made the decision to exclude quasiexperimental designs from the meta-analysis.

Assessment of Methodological Quality

The quality of each study (and its reporting) was assessed according to several characteristics, including (a) the transparency of the study, that is, the clarity with which the investigators reported the random assignment procedures; (b) the integrity of the random assignment design and whether investigators addressed violations of the design; (c) the existence of high levels of attrition of either tutees or tutors from the initially randomized sample; and (d) the existence of substantial problems with respect to treatment fidelity. The quality of each study was assessed by each of us during coding review sessions.

Of the 21 studies included in the meta-analysis, none had clear problems with student assignment to treatment and control conditions, and none had any evidence of problematic attrition. Questions with fidelity were raised in a few of the included studies, and these concerns are noted in the appendix, which summarizes the details of each of the included studies.

A final study quality problem was related to reporting of study statistics. Six of the excluded studies were excluded, at least in part, because of the failure to report adequate descriptive statistics where we could compute standardized mean difference effect sizes. Two of these studies in particular (Compton, 1992; Meier & Invernizzi, 2001) did report inferential statistics that would have allowed for the computation of the necessary descriptive statistics; however, the values presented were not consistent. Compton’s (1992) dissertation had a total sample size of 483 but reported standard deviations that did not match either the reported standard errors or the reported t value, and that would have yielded an effect size of nearly two standard deviations. The Meier and Invernizzi (2001) article reported means but no standard deviations, and the F values reported had degrees of freedom inconsistent with sample sizes in the study.

Data Management and Extraction

Of the 233 articles retained for the initial full-text review, we collected 109 articles from online sources; 65 articles were available in the university library on microfiche, microfilm, or in bound periodicals; 58 articles (including dissertations) were requested by interlibrary loan; and one dissertation was uncollected when the library determined it had exhausted all possible sources to locate a circulating copy. After collecting an electronic or paper copy the 233 articles, we conducted the initial review, which reduced our number of retained articles from 233 to 56. We extracted the information from each article using a coding form we created that was based on our previous systematic review work. All study coding and data management was done using Excel software.

Data Synthesis

We used Hedges’ unbiased estimate g of the standardized mean difference effect size statistic (the difference between the treatment and control group means on an outcome variable divided by the pooled standard deviations for the posttest measure) for each outcome measure. When means and standard deviations were not reported, we estimated the effect sizes using the procedures recommended by Wilson and Lipsey (2001). When both pretest and posttest measures were available, we subtracted pretest group differences, which in most cases were minimal because of the requirement of random assignment. If a study reported only adjusted posttest means or posttest means, we computed the treatment-control difference. In either case, we used the pooled standard deviation of the posttest scores as our denominator in computing the effect sizes d. To get unbiased estimates of the population effect size, we divided by the approximation of Hedges and Olkin (1985).

Of the 28 study cohorts analyzed here, 12 were analyzed using posttest scores adjusted for pretest differences (equivalent to gain scores); 3 were analyzed using posttest scores statistically adjusted for pretest differences (using ANCOVA); 12 were analyzed on the basis of posttest scores only, as no pretest scores were provided; and 1 was analyzed based on a mix of posttest-only scores and adjusted posttest scores. The one common denominator was, in fact, the denominator of the effect size calculation. In each case, the denominator of the effect size statistic was the pooled posttest standard deviation.

Some of the studies employed a variety of outcome measures to assess program effectiveness. Because math outcomes are qualitatively different from verbal outcomes, we did not calculate the effect of each individual study or the overall effect of all available studies. However, we did calculate the overall effect on reading measures. To accomplish this, we computed an overall reading effect size for each of the 25 study cohorts for which some type of reading outcome was assessed. Next, to determine whether an intervention had a greater effect in any one area, we conducted separate meta-analyses of key outcome areas, including standardized overall reading, standardized letters and word skills, standardized reading comprehension, measures of oral reading fluency and writing, and standardized math performance. If a study measured a key outcome in several ways, we averaged the effect sizes of the measures to ensure that each study contributed only 1 data point to the analysis for each key outcome and that no study was unduly “weighted” in the meta-analysis.

Homogeneity Analysis

The homogeneity analysis test determines if variations in the effect sizes are caused by sampling error or other factors. The decision to use a fixed-effects model or random-effects model is based on the homogeneity analysis. The analyses of the overall effects and of the six key outcomes revealed Q statistics that were not large enough to allow us to reject the null hypothesis of homogeneity. That is, the variability across effect sizes did not exceed what would be expected on the basis of sampling error (Lipsey & Wilson, 2001). Therefore, we employed a fixed-effects model for data synthesis in our study.

Sensitivity Analysis

Using the Comprehensive Meta-Analysis software (Borenstein, Hedges, Higgins, & Rothstein, 2005), we tested the extent to which our main results were sensitive to any one study’s inclusion in the meta-analysis. The “one study removed” analysis presents the average standardized mean difference of all remaining studies after each study, in turn, is removed from the analysis. In the end, the sensitivity analysis revealed that one study, with a very large sample, had a disproportionate impact on the meta-analytic outcomes. Consequently, all results based on the reading outcomes reported here exclude the Ritter (2000) study from the sample (according to the sensitivity analysis, the Ritter, 2000, study did not have a disproportionate impact on the mathematics global outcome; thus, the Ritter study was retained in the sample for the meta-analysis of the mathematics outcomes).

Subgroup Analysis

Subgroup analyses were conducted to compute differential mean effect sizes based on various program characteristics, including (a) types of tutors (parent, college student, community member), (b) age of tutees (Grade 1 or above), (c) highly structured versus unstructured programs, and (d) publication source. That is, we examined whether published studies were more likely to show positive program effects. These subgroup analyses were conducted only on the four reading outcomes reviewed here, as the writing and math outcome domains only had six and five studies, respectively.

Publication Bias

The publication bias was measured with the “trim-and-fill” procedure, where the funnel plot was visually inspected for differences. Additionally, we examined the possibility of publication bias by using publication type as a moderator variable.

Results

Description of Included Studies

In the end, the search yielded 21 unique articles, reports, or dissertations, which included 28 unique “study cohorts” or “studies.” The evidence base described here relies on a sample of 1,676 study participants, 873 of whom were in the tutoring treatment groups and 803 of whom were in the control groups.

Outcome measures. The volunteer tutoring programs reviewed here employed a variety of outcome measures to assess program effectiveness. After considering the measures in each of the included studies, we chose to include seven categories of outcome measures in the meta-analysis. The first category is composite reading. This category is based on the 25 studies (total sample of 1,462 students) that assessed some type of reading measure. However, the sensitivity analysis revealed an outlier study, which we removed. Thus, the sample for analysis included 24 studies and a total sample of 1,077 students. After examining composite program effects, we then turned to analyses of the following six specific academic domains:

Reading global:1 Evidence based on 14 studies with a total sample size of 819.

Reading letters and words: Evidence based on 15 studies with a total sample size of 798.

Reading comprehension: Evidence based on 8 studies with a total sample size of 546.

Reading oral fluency: Evidence based on 12 studies with a total sample size of 635.

Writing: Evidence based on 6 studies with a total sample size of 228.

Mathematics global: Evidence based on 5 studies with a total sample size of 643.

There is overlap among the studies and samples described above. That is, many of the same studies with reading global outcomes, for example, also assessed outcomes in the reading oral fluency domain.

Types of tutors. Volunteer tutoring programs generally draw on a variety of sources for tutors. However, there are a few distinctive program types that we separate for analysis here. Some programs train parents as tutors to help their own children. These programs are different from those that train college-age tutors to work with younger students; often these college-age students are in the America Reads program or are preservice teachers. Finally, the remainder of the programs reviewed here used community volunteers from a variety of ages, ranging from older high school students to senior citizens. Programs that used a combination of these tutors are placed in the community volunteer category. In our sample of 28 study cohorts, 5 were from programs using primarily parents (study sample = 338), 12 were from programs using primarily college-age tutors (study sample = 899), and the remaining 11 were from programs using community volunteers across a variety of ages (study sample = 439).

Age and grade level of tutees. One might believe that volunteer tutoring programs work better or worse for older or younger students. Consequently, we divided up our sample of tutoring programs into those that served the youngest students (Grade 1) and those that served older students (Grades 2 and above). In our sample of 28 study cohorts, 14 were focused on students in first grade (study sample = 770) and the remaining 14 were focused on older students (study sample = 906).

Program focus on reading. Most of the programs included a specific focus on including reading skills; such programs might reasonably be expected to have stronger effects on reading scores than programs without such focus. Consequently, we divided up our sample of tutoring programs into those that focused on reading and those that had a more general academic focus. In our sample of 28 study cohorts, 23 were focused on reading (study sample = 1,033) and 5 were not (study sample = 643). We originally planned to use program focus as a moderator variable. However, because only 1 included study with reading outcomes did not have a reading focus (McKinney, 1995), we did not conduct subgroup analyses.

Program structure. Studies were classified as high or low structure depending on the amount of direction and instruction given to the tutors. If the program gave tutors specific lessons and materials to cover, the program was classified as high structure. If the tutees had freedom in selecting the reading materials, but the programs specified how much time in the tutoring session should be spent on each reading activity, the program was also categorized as high structure. Other programs, including some that deliberately were nondirective and provided minimal training to tutors or programs where tutors and tutees simply read together, were classified as low structure.

For example, the Howard Street Tutoring Program, as described by Morris et al. (1990), was classified as structured: “The 3:00–4:00 p.m. tutoring period is carefully planned and work filled, with very few disruptions. . . . A typical 1-hour tutoring lesson takes the following form. . . .” (pp. 136–137). In another case, Vadasy, Jenkins, Antil, Wayne, and O’Connor (1997a) described an early version of the Sound Partners program: “The intervention was a set of 100 after-school lessons, each 30 min long . . . to be used by tutors to teach phonological and early reading skills to first-grade students” (p. 31).

Alternatively, the Start Making a Reader Today (SMART) program was decidedly unstructured: “Tutors are provided with a broad framework to use during sessions, rather than specific techniques” (Baker, Gersten, & Keating, 2000, p. 497). The paired-reading approach and repeated-reading approaches, where students select their own materials to read, were also classified as nonstructured (e.g., Miller, 1994; Weiss, 1989). In our sample of 28 study cohorts, 15 were from programs classified as highly structured (study sample = 919), and 13 were not (study sample = 757).

Source of publication. In the field of systematic reviews, there is a real concern with “publication bias” or “file-drawer bias.” These terms refer to the concept that studies showing null effects are less likely to be submitted for publication and less likely to be accepted for publication, all else equal, if submitted. Thus, one might expect that studies published in journals would be more likely to show positive program effects as compared to those disseminated as unpublished reports, conference papers, or student dissertations. Consequently, we distinguished in our sample the studies of tutoring programs that were published in journals as a test of this “bias.” In our sample of 28 study cohorts, 15 were from studies in refereed journals (study sample = 772); the remaining 13 study cohorts were primarily from doctoral dissertations (study sample = 904).

Overall Effect Sizes Across Studies for All Outcomes

We began by examining the overall effects of volunteer tutoring on student reading outcome measures. Twenty-five studies assessed reading measures of one type or another. In 8 of these 25 studies, the overall reading score was based on a single reading measure; we computed an average effect size based on multiple reading outcomes for the remaining 17 studies that employed multiple reading outcome measures. Our analysis indicates that volunteer tutoring interventions of the type reviewed here have a significant positive effect on the verbal skills of participating students. Using a fixed-effects model, we found that the average effect of volunteer tutoring programs on reading outcomes for elementary students is 0.23.

We then conducted sensitivity analyses to examine whether the average effect of volunteer tutoring on reading was disproportionately influenced by the result of any single study. We found that the Ritter (2000) study had a sample that comprised approximately 25% of the total sample and was more than twice the size of the next-largest study sample. The one-study-removed feature allowed us to conduct the analysis without Ritter, where we discovered that the study did have a substantial negative impact on the average effect size. Once this study was removed, the average effect size increased to 0.30. Moreover, the program evaluated in Ritter was unique from the other programs in that there was not a strong academic focus in the Ritter program. Given the disproportionate influence of the Ritter study on the average effect size and the unique nature of the program evaluated in Ritter, we decided to exclude the study from the domain-specific and subgroup analyses that follow.

Figure 1 presents the forest plot and average reading effect size of 0.30 for the set of 24 studies that excludes the Ritter (2000) study. Using a 95% confidence interval for this effect size, we find that the average effect of volunteer tutoring on reading outcomes ranges from 0.18 to 0.42. The test for heterogeneity produced a Q value that was not statistically significant (Q = 17.29, p = .80); thus, all subsequent results are reported using a fixed-effects model.

Next, we examined the effect of volunteer tutoring programs on the following specific academic domains (as described in the Method section above): reading global, reading letters and words, reading comprehension, reading oral fluency, writing, and mathematics global. Table 1 shows the number of studies, number of students, effect size, and confidence intervals for each of the domains.

Reading global. Thirteen studies assessed outcome measures within the reading global domain; these studies included 195 tutored students in the analysis. Eight of the studies have positive effect sizes, and six have negative effect sizes. However, only two of the most positive effect sizes are statistically different from zero. Overall, the average effect size for this outcome domain is +0.26, an effect that is statistically significant.

Reading letters and words. Fifteen studies assessed outcome measures within the letters and words outcome domain; these studies included 403 tutored students in the analysis. All but two of the studies have positive effect sizes. Although three of these positive effect sizes are statistically different from zero on their own, when all the results are pooled, the effect size is +0.41, an effect that is statistically significant.

Reading comprehension. Eight studies assessed outcome measures within the reading comprehension domain; these studies included 293 tutored students in the analysis. Five of the studies have positive effect sizes, and three have negative effect sizes. Two of these positive effect sizes are statistically different from zero, and the overall effect size is +0.18, an effect that is not statistically significant.

Reading oral fluency. Twelve studies assessed outcome measures within the reading oral fluency domain; these studies included 336 tutored students in the analysis. Ten of the 12 studies have positive effect sizes. Although two of these positive effect sizes are statistically different from zero on their own, the pooled effect size for this outcome is +0.30, an effect that is statistically significant.

Writing. Only six studies assessed outcome measures that the reviewers classified within the writing domain; these studies included 111 tutored students in the analysis. All six of these studies have positive effect sizes. Although one of these positive effect sizes is statistically different from zero on its own, the pooled effect size for this outcome was +0.45, an effect that is statistically significant.

Mathematics global. Five studies assessed outcome measures that we classified within the mathematics domain; because these studies were dissertations with large sample sizes, these five studies included a total of 292 tutored students in the analysis. Three of these studies have positive effect sizes, and two have negative effect sizes. Two of the positive effect sizes are statistically different from zero. When these five studies are pooled together, the overall effect size for this outcome domain is +0.27, an effect that is not statistically significant.

Analysis of Impacts for Subgroups of Studies on Reading Outcomes

Next, we examined the possibility of differential effects of different types of volunteer tutoring programs on the reading outcomes. We focus here only on these “subgroup” effects in which there are at least three studies in each subgroup. Subgroups examined were previously described and include types of tutors, grade level of tutees, program structure, and publication type.

None of the outcomes had a significant difference in effect size by tutor type. That is, programs using parent, college-age, or community tutors did not differ significantly in their effectiveness. Similarly, programs that included Grade 1 were not significantly different from programs for higher grades in their effectiveness. The only significant subgroup difference we found was that highly structured programs had a significant advantage over programs with low structure on the global reading outcome, with an effect size of 0.59 for structured programs and 0.14 for unstructured programs. The other reading outcomes did not differ significantly by amount of program structure. We should note that there were only three studies classified as highly structured that used global reading outcomes, and all three studies had the same lead author (Vadasy et al., 1997a, 1997b; Vadasy, Jenkins, & Pool, 2000).

To assess the possibility of publication bias, we conducted the trim-and-fill procedure, which trims excessively large studies and imputes small studies that may be missing. The 24 studies conform neatly and symmetrically to the shape of the funnel plot, suggesting that there is not a large effect of publication bias on our results. Additionally, the trim-and-fill procedure was conducted, but there were no excessively large studies to trim. Thus, the observed overall effect size is likely based on an unbiased set of studies.

As an additional test of the possibility of publication bias or file-drawer bias, we distinguished studies published from dissertations or other unpublished works. The overall trend revealed in this meta-analysis does not indicate publication bias. That is, in each of the reading domains examined here, the effect sizes from the studies published in journals were not significantly larger than those presented in dissertations and unpublished studies. Table 2 shows each domain’s respective effect size, confidence interval, and the number of studies by publication type.

Discussion and Conclusions

The objective of this systematic review was to gather, summarize, and integrate the empirical research on the effectiveness of volunteer tutoring programs. The current research base is strong—relative to that of other educational interventions—in that the review uncovered more than 20 randomized field trials. However, this good news is tempered by the fact that most of these studies employ small samples. Nineteen of the 28 study cohorts in this meta-analysis included 25 or fewer students in the tutoring group; only 3 of the study cohorts had full study samples (treatment plus control) of more than 100 students. In the end, the 28 study cohorts in the 21 articles reporting on randomized field trials included a full study sample of 1,676 students (873 tutored students and 803 control students). In cases where there are many studies, most with small samples with very little power to detect program effects, the strength of meta-analysis is that the results of the small studies are pooled and the statistical power is enhanced.

There were numerous academic outcomes assessed in the 21 articles that employed experimental designs to investigate the effectiveness of volunteer tutoring programs. First, we analyzed the overall effect of volunteer tutoring programs on all reading outcomes and found a positive and statistically significant positive effect of 0.30 standard deviations. We then grouped the specific outcomes into six domains. Two of these domains were very broad and employed standardized assessments of general skills in reading and math: reading global and math global. The other four outcomes were focused on specific subskills related to reading and language: letters and words, reading comprehension, reading oral fluency, and writing.

The central goal of this analysis was to examine whether a volunteer tutoring intervention represents a potentially effective strategy for improving academic skills for young students. The answer, according to the existing set of randomized field trials, is a qualified yes. Participation in a volunteer tutoring program results in improved overall reading measures of approximately one third of a standard deviation. With respect to particular subskills, students who work with volunteer tutors are likely to earn higher scores on assessments related to letters and words, oral fluency, and writing as compared to their peers who were not tutored. The effect sizes connected to these outcome domains were relatively consistent, ranging from 0.26 to 0.45.

A secondary goal was to assess whether certain programs are particularly effective. As for the secondary goal, the review reveals that the programs are unique and the individual studies are based on small samples. Furthermore, the programs are small enough so as not to be replicable. Thus, the relevant question may not be “Which of these programs are most effective?” Rather, the more important question is “What are the characteristics of effective programs?”

To address this question, the reviewers computed differential mean effect sizes based on various program characteristics, including (a) types of tutors, (b) age of tutees, (c) level of structure in the programs, and (d) publication bias. These subgroup analyses were conducted only on overall reading and on the four specific reading domains.

However, for the most part, effects were not significantly different across these subgroups. Nonetheless, we can derive some lessons from the characteristics of the programs analyzed. The majority of the studies reviewed here evaluated reading-focused programs delivered to primary-age students. Programs did not have to be highly structured to have positive effects, nor did they have to use a particular type of person as a tutor.

The results of this analysis lead to three clear conclusions. First, very little is known about the effectiveness of volunteer tutoring interventions at improving math outcomes. This is disappointing, given the important role that early numeracy skills play in later math achievement for students in elementary and middle school. Given this lack of information, it would be useful for educators to develop and implement volunteer tutoring programs focused on early math skills while researchers worked collaboratively to evaluate the effectiveness of these programs. In closing, both practitioners and researchers can benefit after more is known about whether volunteer tutoring might be beneficial for students struggling with early math skills.

Second, the research base for volunteer tutoring, although based mostly on studies with small samples, is useful precisely because there are so many studies that employ experimental designs. This illustrates the power and utility of meta-analysis. Although many of the individual studies, standing alone, do not show significant program effects, the overall effect is relatively large and statistically significant in five of the seven outcome domains examined here. As a result, the evidence base in the field benefits from small, randomized field trials in which data are reported thoroughly and carefully.

Third, policy makers and educators should view this work as an important piece of evidence when deciding whether to employ volunteer tutoring as a strategy to improve academic skills for young students. As educators across the country work to meet adequate yearly progress goals in state accountability systems, and as they seek affordable ways to offer additional services to students at risk of not meeting annual academic goals, it would be worthwhile to consider structured, reading-focused volunteer tutoring programs as strategies to improve reading and language skills.

Footnotes

APPENDIX

Summary of Characteristics of Included Studies

	Study; type; program description	Tutees	Tutors	Time and duration	Reading focus?	Highly structured?	Outcome measures	Concerns/comments
1	Allor & McCathren (2004.1); journal; early literacy tutoring program (urban schools in the South)	Grade 1, N = 86 (T = 61, C = 25)	Education majors (unpaid) or America Reads members (stipend)	Three or four 15- to 20-min sessions per week for 6 months	Yes	Yes	Rd-letters and words: WJ-R (2), TOWRE (2), DIBELS (1)a Rd-comprehension: WJ-Rb Rd-oral fluencyc Pre-/posttest gains	Two cohort years considered separately; this is Cohort 1.
2	Allor & McCathren (2004.2); journal; early literacy tutoring program (urban schools in the South)	Grade 1, N = 157 (T = 76, C = 81)	Education majors (unpaid) or America Reads members (stipend)	Three or four 15- to 20-min sessions per week for 6 months	Yes	Yes	Rd-letters and words: WJ-R (2), TOWRE (2), DIBELS (2)d Rd-comprehension: WJ-R Rd-oral fluency Pre-/posttest gains	Two cohort years considered separately; this is Cohort 2.
3	Baker, Gersten, & Keating (2000); journal; SMART: Start Making a Reader Today (Oregon)	Grade 1 then Grade 2, N = 84 (T = 43, C = 41)	Community (unpaid)	Two 30-min sessions per week for 2 years	Yes	No	Rd-letters and words: WRMT Word Identification Rd-comprehension: WRMT (2)e Rd-oral fluency (2)f	Only the 2nd-year findings are included.
4	Cobb (2001.1); journal; play and phonological awareness activities by preservice teachers (midwestern city)	Grade 1, N = 18 (T = 9, C = 9)	Preservice teachers (unpaid)	Two 45-min sessions for 10 weeks	Yes	No	Adjusted posttest Rd-global: GRAT Rd-letters and words: GRAT Efficiency subscale (4)g Posttest t values	Three grade levels had different outcomes and are reported separately; this is Grade 1. No SDs reported, so effect sizes transformed from reported t values to Cohen’s d.
5	Cobb (2001.2); journal; play and phonological awareness activities by preservice teachers (midwestern city)	Grade 2, N = 20 (T = 12, C = 8)	Preservice teachers (unpaid)	Two 45-min sessions for 10 weeks	Yes	No	Rd-global: GRAT Rd-letters and words: GVOC Rd-comprehension: GCOMP Posttest t values	Three grade levels had different outcomes and are reported separately; this is Grade 2. No SDs reported, so effect sizes transformed from reported t values to Cohen’s d.
6	Cobb (2001.3); journal; play and phonological awareness activities by preservice teachers (midwestern city)	Grade 3, N = 18 (T = 9, C = 9)	Preservice teachers (unpaid)	Two 45-min sessions for 10 weeks	Yes	No	Rd-global: GRAT Rd-letters and words: GVOC Rd-comprehension: GCOMP Posttest t values	Three grade levels had different outcomes and are reported separately; this is Grade 3. No SDs reported, so effect sizes transformed from reported t values to Cohen’s d.
7	Cook (2001.1); dissertation; minimally trained tutors using America Reads materials (suburb of Phoenix, AZ)	Grade 1, N = 26 (T = 12, C = 14)	University students (some unpaid, some work study)	Two 45-min sessions per week for 7 months	Yes	No	Rd-global: Grade 1 WRAT Posttest	Problems with tutor attrition and undependable tutors. Reports separated by grade so that data were analyzed as three separate cohorts; this is Grade 1.
8	Cook (2001.2); dissertation; minimally trained tutors using America Reads materials (suburb of Phoenix, AZ)	Grade 2, N = 17 (T = 7, C = 10)	University students (some unpaid, some work study)	Two 45-min sessions per week for 7 months	Yes	No	Rd-global: Grade 2 WRAT Posttest	Problems with tutor attrition and undependable tutors. Reports separated by grade so that data were analyzed as three separate cohorts; this is Grade 2.
9	Cook (2001.3); dissertation; minimally trained tutors using America Reads materials (suburb of Phoenix, AZ)	Grade 3, N = 17 (T = 11, C = 6)	University students (some unpaid, some work study)	Two 45-min sessions per week for 7 months	Yes	No	Rd-global: Grade 3 WRAT Posttest	Problems with tutor attrition and undependable tutors. Reports separated by grade so that data were analyzed as three separate cohorts; this is Grade 3.
10	Erion (1994); dissertation; parent tutoring with flash cards and reading (rural northwestern Pennsylvania)	Grade 2, N = 24 (T = 12, C = 12)	Parents (unpaid)	Five 15-min sessions per week for 6 weeks	Yes	No	Rd-oral fluency Pre-/posttest gains
11	Mahoney (1986); dissertation; parent tutoring in mathematics (Lakewood, OH)	Grade 3, N = 150 (T = 75, C = 75)	Parents (unpaid)	Five 30-min sessions per week for 4 weeks	No	Yes	Mathematics 50-item multiplication test Posttest	Random assignment at the classroom level; test was not standardized.
12	Mayfield (2000); dissertation; Edmark Reading Program: structured tutoring (rural northern Louisiana)	Grade 1, N = 62 (T = 31, C = 31)	America Reads members	Five 15-min sessions per week for one semester	Yes	Yes	Rd-letters and words: WRMT (3)h Rd-comprehension: WRMT Pre-/posttest gains (has adjusted posttest also)	One decoding measure (word attack) reported wrong values for SD T, so effect size was computed with SD C rather than SD pooled. The authentic measure was too closely linked to the curriculum and was not included.
13	McKinney (1995); dissertation; Leap Frog Program: After-school tutorial program at a church (rural northeastern Mississippi)	Grades 1 and 2, N = 44 (T = 20, C = 24)	University students (unpaid)	Four 1-hr sessions per week for 22 weeks	No	No	Rd-global: Stanford 8 Mathematics: Stanford 8 Pre-/posttest gains	Outcomes were reported as percentiles.
14	Mehran & White (1988); journal; Reading Made Easy parent tutoring (small western city)	Grade 1, N = 76 (T = 38, C = 38)	Parents (unpaid)	Three 15-min sessions per week for school year	Yes	Yes	Rd-global: WJPEB, CTBS Rd-letters and words: WJPEB (2), CTBS (2), Harrison (4)i Rd-comprehension: WJPEB, CTBS Posttest	Treatment fidelity: Most parents reported less than four sessions a month. Pretest given for CTBS only, but not with the same subscores as posttest, so gains were not computed.
15	Miller (1994); dissertation; Paired Reading parent tutoring (midwestern district)	Grades 2–4, N = 52 (T = 26, C = 26)	Parents (unpaid)	Four 10- to 15-min sessions per week for 10 weeks	Yes	No	Rd-global: GORT-D Pre-/posttest gain scores	Treatment fidelity: Parents were to tape tutoring sessions, but few did.
16	Morris, Shaw, & Perney (1990.1); journal; Howard Street Tutoring Model: after-school program (Illinois)	Grades 2–3, N = 34 (T = 17, C = 17)	Community volunteers, various ages (unpaid)	Four 30-min sessions per week for entire school year	Yes	Yes	Rd-letters and words (4)j Rd-oral fluency: basal passages Gain scores only	The 1986–1987 and 1987–1988 cohorts are separate studies in the meta-analysis. There are some students from each grade in each cohort. Measures may not be standardized.
17	Morris et al. (1990.2); journal; Howard Street Tutoring Model: after-school program (Illinois)	Grades 2–3, N = 26 (T = 13, C = 13)	Community volunteers, various ages (unpaid)	Four 30-min sessions per week for entire school year	Yes	Yes	Rd-letters and words (5)k Rd-oral fluency: basal passages Gain scores only	The 1986–1987 and 1987–1988 cohorts are separate studies in the meta-analysis. There are some students from each grade in each cohort. Measures may not be standardized.
18	Nielson (1991); dissertation; parent and adult volunteer tutoring in reading (rural elementary school in Delta, UT)	Grade 3, N = 43 (T = 29, C = 14)	Parents, adult volunteers (unpaid)	Sessions per week not stated; program lasted 9 months	Yes	Yes	Rd-comprehension: Stanford 8 Posttest	Follow-up findings were not included. Parent and volunteer adults were combined into one treatment group for analysis.
19	Parham (1994.1); dissertation; before-school tutoring in prealgebra concepts (suburban school) with trained tutors	Grade 7, N = 32 (T = 16, C = 16)	Community volunteers (unpaid)	One 60-min session per week for 5 weeks	No	Yes	Mathematics: OHAPT Posttest	Only adult tutor findings were included. No mention of attrition, despite the 7 AM start time. A single control group was divided in half between the two Parham treatment groups.
20	Parham (1994.2); dissertation; before-school tutoring in prealgebra concepts (suburban school) with untrained tutors	Grade 7, N = 32 (T = 16, C = 16)	Community volunteers (unpaid)	One 60-min session per week for 5 weeks	No	No	Mathematics: OHAPT Posttest	Only adult tutor findings were included. No mention of attrition, despite the 7 A.M. start time. A single control group was divided in half between the two Parham treatment groups.
21	Powell-Smith, Shinn, Stoner, & Good (2000); journal; parents using children's literature or basal readers with guided practice and feedback and monitoring of treatment fidelity	Grade 2, N = 36 (T = 24, C = 12)	Parents (unpaid)	Four 20-min sessions per week for 5 weeks	Yes	Yes	Rd-oral fluency: CBM, TORF (2)l Pre-/posttest gain scores	Two versions of tutoring were combined, one with basal readers and one with children's literature. Some treatment fidelity concerns.
22	Pullen, Lane, & Monaghan (2004); journal; repeated reading, coaching in decoding, and reading new books (north central Florida)	Grade 1, N = 49 (T = 25, C = 24)	University students (unpaid)	Forty 15-min sessions in 12 weeks	Yes	Yes	Rd-letters and words: Jump Start (3) and WDRB (2)m Pre-/posttest gains; posttest
23	Rimm-Kaufman, Kagan, & Byers (1999); journal; Comprehensive reading model emphasizing reading for meaning (Cambridge, MA)	Grade 1, N = 42 (T = 21, C = 21)	Community volunteers (unpaid)	Three 45-min sessions per week for 8 months	Yes	Yes	Rd-letters and words: observational survey (3)n Rd-writing: observational survey (2)o Rd-oral fluency: observational survey Pre-/posttest gains
24	Ritter (2000); dissertation; West Philadelphia Tutoring Project: university-based partnership (Philadelphia, PA)	Grades 2–5, N = 385 (T = 196, C = 189)	College students (unpaid)	1 hr per week for entire school year	No	No	Rd-global: SAT-9 Mathematics: SAT-9 Posttest	Subgroup sample sizes for outcomes (after attrition) were not reported but were estimated from the overall T:C sample size ratio
25	Vadasy, Jenkins, Antil, Wayne, & O'Connor (1997a); journal; Sound Partners: 100 scripted lessons on phonological awareness, word ID, text reading, writing (large urban district)	Grade 1, N = 40 (T = 20, C = 20)	Community volunteers (stipend)	Four 30-min sessions per week for 27 weeks	Yes	Yes	Rd-global: WRAT-R Rd-letters and words (4): WJ-R, Dolch, Yopp-Singerp Rd-writing: classroom writingq Rd-oral fluency: Analytic Reading Inventory Posttest	Some problem with tutor consistency.
26	Vadasy et al. (1997b); journal; 2nd year of Vadasy et al. (1997a) (large urban district)	Grade 1, N = 40 (T = 20, C = 20)	Community volunteers (stipend)	Four 30-min sessions per week for 27 weeks	Yes	Yes	Rd-global: WRAT-R Rd-letters and words (7): WJ-R, Dolch, Bryant, Yopp-Singerr Rd-writing: classroom writing Rd-oral fluency: Analytic Reading Inventory Adjusted posttest means	Some problem with tutor consistency. Did not include Lesson Word List outcome; it was too specific to the treatment.
27	Vadasy, Jenkins, & Pool (2000); journal; Sound Partners: 100 scripted lessons on phonological awareness, word ID, text reading, writing (large urban district)	Grade 1, N = 46 (T = 23, C = 23)	Community volunteers (stipend)	Four 30-min sessions per week for school year	Yes	Yes	Rd-global WRAT-R Rd-letters and words: WJ-R, Dolch, Bryant, Yopp-Singer Rd-writing: classroom writing Rd-oral fluency: Analytic Reading Inventory (2)s Adjusted posttest means	Only immediate posttest results are used; 2nd grade follow-up results are not included.
28	Weiss (1989); unpublished report; Paired Reading (suburban district)	Grades 3–6, N = 20 (T = 11, C = 9)	Community (unpaid)	Four 20- to 30-min sessions per week for 11 weeks	Yes	No	Rd-global BASIS Rd-oral fluency: curriculum-based measure Pre-/posttest gains

Note. If multiple cohorts were produced from the same study, the distinction is placed following the year. For example, Allor and McCathren (2004) is divided into Allor and McCathren (2004.1) and Allor and McCathren (2004.2). T = participants in the treatment condition; C= participants in the control condition; Rd = Reading; WJ-R = Woodcock Johnson–Revised; TOWRE = Test of Word Reading Efficiency; DIBELS = Dynamic Indicators of Basic Early Literacy Skills; WRMT = Woodcock Reading Mastery Test–Revised; GRAT = Gates-MacGinitie Reading Achievement Test; GVOC = GRAT Vocabulary section; GCOMP = GRAT Reading Comprehension section; WRAT = Wide Range Achievement Test; WJPEB = Woodcock-Johnson Psycho-Educational Battery; CTBS = Comprehensive Test of Basic Skills; GORT-D = Gray Oral Reading Test–Diagnostic; OHAPT = Orleans-Hanna Algebra Prognosis Test; TORF = Test of Reading Fluency; WDRB = Woodcock Diagnostic Reading Battery; BASIS = Basic Achievement Skills Individual Screener.

Two sections of the WJ-R (Word Identification and Word Attack), two sections of TOWRE (real word and nonword), and one section of DIBELS (Phoneme Segmentation Fluency).

Passage Comprehension section.

Included stories adapted from Monitoring Basic Skills Progress.

The additional measure used in the second cohort study was Nonsense Word Fluency.

Word Comprehension and Passage Comprehension sections.

Number of words Grade 1 and Grade 2 students read correctly in 1 min.

GRAT initial consonants and clusters, final consonants and clusters, vowels, and use of sentence context.

Word Identification, Letter Identification, and Word Attack.

Two subscales from the WJPEB (Letter Word Identification and Word Attack), two subscales from the CTBS (Word Analysis and Vocabulary), and four subscales from the Harrison Criterion Referenced Test (Producing Sounds, Consonant Sounds, Short Vowels, and Digraphs and Combinations).

Basal word recognition, untimed word recognition, and two spelling tests.

A timed flash-word recognition test.

A curriculum-based measure and the TORF.

Three Jump Start tests (Phonological Awareness, Sight Words, and Nonword Decoding) and two WDRB tests (Letter Word Identification and Word Attack).

Based on three observational surveys for letters, words, and concepts about print.

Based on two observational surveys for writing and dictation.

WJ-R Word Attack and Word Identification subtests, the Dolch Word Recognition Test, and the Yopp-Singer Segmentation Test.

Children were asked to write for 5 min in response to a prompt. The responses were then marked according to whether the words were written and spelled correctly.

Additional measures used in this study are the Bryant Pseudoword and Pseudoword List subtests and the Lesson Word List exam.

The Analytic Reading Inventory includes a subtest for the primary level and Grade 1 level.

Figure and Tables

Acknowledgements

Part of this work was funded through a generous grant from the Campbell Collaboration and a portion of the paper can be found in the Campbell Collaboration and Register of Interventions and Policy Evaluation (C2-RIPE).

1

Reading global is based on measures of overall reading ability, compared to the composite reading comparison, which was based on the effect of all the reading measures combined.

References

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.