Abstract
Evaluating academic support is frequently complicated by student selection bias. To provide an actionable framework for student retention, this study compares the causal impact of two uniformly trained peer tutoring methods offered concurrently within identical courses (n = 6,731). To isolate the true effect of the interventions, data were analyzed using an Average Treatment Effect in the Overlap Population propensity score-weighted framework—a rigorous statistical technique that balances preexisting student characteristics to mimic a randomized trial. Analyzing dosages across one-, three-, and six-sessions revealed stark differences in outcomes. Small group tutoring failed to improve grades, with isolated sessions indicating a negative “desperation visit” effect. Conversely, Supplemental Instruction-adjacent tutoring significantly improved grades after just three sessions (p < .001), remaining highly robust to unmeasured confounding (E-value = 34.86). Furthermore, sustained participation in both methods neutralized severe academic deficits. Ultimately, this study provides vital quantitative evidence that structural design dictates tutoring efficacy, guiding institutional resource allocation.
Keywords
Introduction
The primary purpose of this study is to rigorously evaluate the causal impact of two peer tutoring methods on student course grades to inform and optimize academic support programming. Recent industry reports indicate that universities in the United States spend an average of $2,933 per student annually on academic support services (Mowreader, 2025). In Spring 2025, there were approximately 18.4 million students enrolled in higher education in the United States (National Student Clearinghouse, 2025). This amounts to approximately $54 billion in academic support expenses for U.S. institutions of higher education per year. This massive financial investment is largely driven by the understanding that early academic success in high-risk courses is a primary predictor of student persistence and institutional retention (Balzer Carr & London, 2019; Grillo & Leist, 2013; Koch, 2017; Tinto, 2012). With this sort of investment, we have to ask whether universities are receiving a fair return on investment (ROI) for their efforts. In tutoring, for example, are all methods of providing learning assistance created equal? Does more tutoring automatically mean better grades?
Academic support services is a broad field within education encompassing a variety of services, including tutoring, success coaching, advising, career counseling, learning labs, and more (Boylan, 2002). Peer tutoring, in particular, consists of one-on-one, small group, and nine types of peer collaborative learning methods (Arendale, 2022). Peer tutoring is often defined as one student instructing another in a subject in which the “tutor” is more experienced, knowledgeable, or competent (Damon & Phelps, 1989). Unlike traditional one-on-one academic support, which often relies on a direct student-to-tutor transmission of information, the models evaluated in this study, both standard Group Tutoring and Embedded (Supplemental Instruction [SI]-adjacent) Tutoring, are fundamentally rooted in cooperative learning principles. In these group-based frameworks, the primary mechanism of academic recovery is peer collaboration. Students work collectively to construct knowledge and solve problems, while the tutor steps away from a lecturing role to act as a trained facilitator guiding the cooperative process (Arendale, 2022). Additionally, tutors may or may not receive training, which significantly improves the efficacy of tutoring services (Boylan et al., 1995).
For this article, “Group Tutoring” will refer to a postsecondary peer-assistance academic program overseen by an institution of higher learning that places students in small groups of 8 or fewer and is led by a peer tutor trained through a College Reading and Learning Association (CRLA)-accredited training program. “Embedded Tutoring” will refer to a SI-adjacent program differentiated from Group Tutoring by the following attributes: (a) Tutors have completed the course and are actively taking it again with the students they are tutoring; and (b) rather than an answer-seeking model, sessions are defined by cooperative learning and the administration of tutor-generated peer collaborative activities during session time. For this study, both tutoring types are provided by the same office, with the same supervisorial framework, and the same baseline CRLA Level 1 training program.
Determining a Tutoring Program
When selecting student support services, professionals rely on a robust, decades-long body of qualitative best practices and foundational research, such as the ICLCA's Best Practice Guide for Academic Support Program Design and Improvement (Bailey & Neuburger, 2023). However, translating these broad guidelines into specific institutional frameworks requires highly nuanced decisions. Subsequent choices regarding the continuation, scale-up, and funding of these programs are frequently governed by tight resource constraints. In practice, local programmatic decisions are often guided by evaluations comparing the academic outcomes of participating students with those of nonparticipants (Rutschow et al., 2011). While valuable, these standard comparisons frequently struggle to control for the inherent selection bias of opt-in support services, and very little literature quantitatively compares the causal efficacy of different structural tutoring methods against one another. To make optimal, evidence-based structural choices and maximize ROI, administrators require rigorous causal methodologies that bridge the gap between general tutoring theory and localized resource allocation. Research must isolate the true causal impact of specific tutoring models to determine which interventions represent the best use of institutional time, attention, and financial resources.
Theoretical Framework
To maximize institutional ROI, administrators must evaluate learning improvement programs using established theoretical structures. While Keimig's (1983) hierarchy of learning improvement programs provides a foundational administrative framework for categorizing academic support, ranging from remedial courses to comprehensive learning systems, there remains a critical lack of empirical guidance on how to choose among these structures. To address this evaluative gap and provide direct quantitative evidence, this study is grounded in a dual theoretical framework that examines both the quantity and the contextual quality of academic support: Astin's Theory of Student Involvement (1984) and Vygotsky's Sociocultural Theory, specifically the Zone of Proximal Development (ZPD) (1978).
Astin (1984) posits that student learning is directly proportional to the physical and psychological energy devoted to the academic experience. Within this study, tutoring appointments serve as a quantifiable metric of this academic involvement. Specifically, “dosage” is operationalized as the cumulative number of discrete tutoring sessions a student attends throughout the semester, with each session representing a standardized time investment (e.g., 60 min). By evaluating these distinct session-based dosages—specifically focusing on one-, three-, and six-session thresholds—this research empirically tests the minimum viable level of involvement required to yield significant academic gains, while also identifying the structural ceilings on that involvement.
However, measuring the mere quantity of involvement is insufficient without examining the environment in which it occurs. Vygotsky's ZPD emphasizes that cognitive development is optimized when students are guided by a “more knowledgeable other.” This framework provides a critical lens for contrasting standard Group Tutoring, which relies on help-seeking outside the primary learning environment, with Embedded Tutoring, which structurally situates the peer educator directly within the classroom. This study hypothesizes that by aligning the “more knowledgeable other” with the instructional content in real time, Embedded Tutoring optimizes the ZPD, theoretically resulting in more robust academic outcomes. By statistically comparing these two tutoring methods administered within the same courses, this article aims to determine which structural model most effectively facilitates cognitive scaffolding and student success.
Methodology
The purpose of this quantitative study is to evaluate the causal effect of different tutoring methods on final course grades to determine which structural model most effectively facilitates cognitive scaffolding. This research employs a retrospective observational cohort study design, using propensity score weighting and doubly robust estimation to account for baseline confounding.
Setting and Participants
Data for this study were drawn from a public R1 research institution in the Northeastern United States. The dataset comprises a cohort of n = 6,731 undergraduate students enrolled in six high-risk “gateway” courses, critical junctures in student completion pathways (Howell & Walkington, 2022), over four consecutive academic semesters, from Fall 2023 through Spring 2025. To ensure the generalizability of the findings, these courses spanned a diverse range of academic disciplines, including natural sciences (biology), social and behavioral sciences (psychology), health professions (nursing), and quantitative reasoning (statistics). Furthermore, to control for variation in individual teaching styles and grading structures, the sample includes course sections taught by five distinct faculty members (Table 1). Data were collected retrospectively from institutional databases, capturing student demographic information, prior academic performance, and records of tutoring center appointments. Prior to statistical analysis, all records were merged and deidentified to ensure participant confidentiality. The study was approved by the institution's Institutional Review Board under exempt review for the secondary analysis of existing data.
Measures
The primary dependent variable for this analysis is the final Course Grade. The primary independent variable (treatment) is the Tutoring Category, defined by the type of tutoring received: No Tutoring, Group Tutoring, Embedded Tutoring, or Both Tutoring Types. In propensity score analysis, it is critical to select covariates that theoretically influence both the probability of receiving the treatment and the final outcome (Brookhart et al., 2006). Therefore, the selection of covariates for this study was guided by foundational literature on college student achievement, which demonstrates that precollege academic readiness and baseline demographics significantly impact a student's help-seeking behaviors and their ultimate course success (Bowman et al., 2023; Pascarella & Terenzini, 2005). To control for confounding, several demographic and academic covariates were measured prior to the treatment, including Gender, Race, Age, and Pell-eligibility status. Baseline academic achievement was measured using the prior semester's cumulative college GPA; for first-semester students lacking an established college GPA, cumulative high school GPA was utilized as a proxy measure. To account for potential metric differences between these two scales, a binary indicator variable denoting the source of the GPA (college vs. high school) was included as an additional covariate.
The Embedded Tutoring model evaluated in this study was explicitly designed utilizing the foundational principles of SI. It is important to note that, unlike the SI model, which is copyrighted and strictly governed by the University of Missouri-Kansas City (The International Center for Supplemental Instruction, n.d.), the term “Embedded Tutoring” lacks a centralized governing body, and programmatic guidelines vary widely across institutions and literature (Tedesco et al., 2026). Given this variance, the Embedded Tutoring model evaluated in this study is operationalized specifically as a program that retains the structural elements of SI—namely, attaching the peer educator directly to the course lectures—while incorporating customized CRLA training. Unlike models in which peer educators actively roam the classroom to assist with in-class assignments, the model evaluated in this study uses the peer educator strictly as a synchronous observer during the lecture and subsequently hosts independent, collaborative review sessions outside of class hours. For analysis, Embedded Tutoring will be considered adjacent to SI.
Statistical Analysis
Because students self-selected into tutoring methods, traditional unadjusted comparisons would be subject to selection bias (Austin, 2011; Shadish et al., 2002). Within the specific context of academic support evaluation, traditional single-equation regression models routinely fail to account for the fact that tutoring attendance and final grades are jointly determined endogenous variables, leading to highly biased estimates of program effectiveness (Bowles & Jones, 2004). To address this, the primary statistical analysis relies on propensity score weighting to estimate the Average Treatment Effect in the Overlap Population (ATO). Unlike traditional Inverse Probability Weighting, which can be skewed by extreme propensity scores, ATO weighting restricts inference specifically to the overlap region (Stürmer et al., 2021). This effectively targets students in clinical equipoise—individuals whose baseline characteristics give them a plausible probability of receiving either treatment (Li et al., 2018; Rizk, 2025). This methodological choice safely isolates the students who are genuinely on the margin of utilizing academic support.
To determine the maximum reliable treatment dosage for propensity score weighting, an initial analysis of covariate balance was conducted via standardized mean differences (SMDs) and visual inspection of Love plots (Greifer, 2024). Initial analysis revealed a violation of the common support assumption at higher dosages; consequently, six visits were empirically identified as the maximum reliable threshold for achieving strict balance, defined as an SMD < 0.10 (Austin, 2011).
To safeguard against potential residual confounding after weighting, a doubly robust estimation strategy was deployed. By including the baseline demographic and academic covariates directly in the weighted ANCOVA outcome models, this approach mathematically adjusts for minor residual imbalances, providing unbiased estimates of the treatment effect (Bang & Robins, 2005; Funk et al., 2011). Finally, to quantify the robustness of any significant treatment effects against potential unmeasured bias (such as innate student motivation), sensitivity analyses were conducted using E-values (VanderWeele & Ding, 2017).
Data preprocessing, propensity score estimation via ATO weighting, and covariate balance diagnostics were performed in R (Version 4.5.2) using the WeightIt and cobalt packages, with AI-assisted syntax available upon request. Subsequent doubly robust ANCOVA outcome modeling, sensitivity analyses, and pairwise comparisons were executed using IBM SPSS Statistics (Version 29).
Results
Assessment of Covariate Balance
Prior to propensity score weighting, the baseline demographic and academic characteristics of the 6,731 students were assessed across the four tutoring methods, as shown in Table 2. A review of these unweighted characteristics via Chi-square tests of independence revealed significant baseline imbalances, indicating severe selection bias. Specifically, self-selection into tutoring categories varied significantly by student sex (p < .001) and GPA source (p < .001), while race and Pell-eligibility status did not significantly drive treatment selection. Most prominently, students who self-selected into Embedded Tutoring entered the course with a higher cumulative GPA than their peers. This unadjusted disparity explicitly highlights behavioral selection bias and necessitates the doubly robust propensity score weighting approach employed in this study.
Baseline Sample Distribution by Discipline and Instructor.
Unweighted Baseline Characteristics of the Study Sample by Tutoring Method.
Note: M = mean; SD = standard deviation; GPA = Grade Point Average.
Data represent the raw, unweighted baseline characteristics of the retrospective cohort prior to applying Average Treatment Effect in the Overlap Population propensity score weights. “GPA Source: High School” indicates the percentage of first-semester students for whom a cumulative high school GPA was utilized as a proxy measure for baseline academic achievement.
Following the application of ATO weights at the 6-visit threshold, covariate balance was reassessed. As demonstrated in Figure 1, the weighting procedure successfully achieved strict balance across the vast majority of covariates, reducing the SMDs to below the 0.10 threshold. While minor residual imbalance was observed for Age and White racial category (with SMDs marginally exceeding the 0.10 threshold), the risk of residual confounding was mitigated through the doubly robust estimation strategy. By including these specific covariates directly in the final weighted ANCOVA outcome model, any remaining minimal imbalance was mathematically adjusted for, ensuring highly robust causal estimates.

Love plot for ATO weighted variables at the six-visit threshold.
Main Effect of Tutoring Methods
To ascertain the overall impact of the academic support structures, a doubly robust ANCOVA was conducted to evaluate the effect of tutoring methods on final course grades. The model revealed a statistically significant main effect of tutoring category, F(3, 6721) = 183.78, p < .001. The estimated marginal means and standard errors for each of the four tutoring methods, adjusted for all baseline demographic and academic covariates, are presented in Table 3.
Descriptive Statistics for Tutoring Methods at the Six-Visit Threshold.
Note. SE = standard error.
Pairwise Comparisons and Sensitivity Analysis
Subsequent pairwise comparisons, detailed in Table 4, were conducted to isolate the precise causal impact of each specific method compared to the baseline (No Tutoring). At the six-visit threshold, students who utilized Embedded Tutoring scored an estimated 3.84 points higher on their final course grades (measured consistently on all sections on a standard 100-point percentage) compared to peers who received no tutoring (p < .001). To quantify the robustness of this specific finding, an E-value was calculated. Sensitivity analysis indicated that this effect is highly robust to unmeasured confounding; an unmeasured covariate (such as innate student motivation) would need to be associated with both the likelihood of a student attending six Embedded Tutoring sessions and their final course grade by a significant margin (E-Value = 34.86), to completely explain away the observed grade increase. A confounder weaker could not negate the finding. Conversely, students utilizing Group Tutoring at this threshold scored an estimated 2.42 points lower on their final course grades than the baseline group, a statistically significant finding that demonstrated lower relative robustness to unmeasured confounding (E-value = 11.72).
Comparison Statistics for Tutoring Methods Versus No Tutoring at the Six-Visit Threshold.
Note: SE = standard error; CI = confidence interval; ATO = Average Treatment Effect in the Overlap Population.
The Adjusted Mean Difference represents the estimated grade difference compared to the “No Tutoring” baseline, holding all demographic and academic covariates constant via ATO weighting and doubly robust estimation. Positive values indicate a higher course grade than the baseline. E-values represent the minimum risk ratio that an unmeasured confounder would need to have for both the treatment and the outcome to fully explain away the observed effect. E-values are not calculated for nonsignificant findings where the 95% confidence interval crosses zero.
An analysis of the “Both Tutoring Types” category provides compelling insight into programmatic synergy (see Figure 2). At the 1-visit threshold, students utilizing both methods exhibited significantly lower grades compared to the baseline (p < .001), mirroring the severe academic distress characteristic of the isolated Group Tutoring cohort. However, unlike the Group-only cohort, which remained significantly negative across all dosages, students who sustained participation in both methods saw their academic deficit neutralized. By the six-visit threshold, the course grades for students utilizing Both Tutoring Types were statistically indistinguishable from the baseline (p = .729). This trajectory suggests that the addition of synchronous, SI-adjacent Embedded Tutoring effectively arrested the academic decline associated with the standard group demographic, pulling an acutely at-risk population back to the institutional baseline.

Estimated marginal means of course grades across tutoring categories and dosage thresholds.
Analysis of lower dosage thresholds demonstrated a cumulative effect aligned with Astin's theory of involvement; while one-visit and three-visit thresholds yielded statistically significant positive effects for Embedded Tutoring, the magnitude of the effect and the corresponding E-values escalated as students approached the six-visit maximum.
Discussion
While theoretical frameworks such as Keimig's (1983) hierarchy offer administrators clear models for structuring learning improvement programs, there remains a critical lack of direct empirical evidence to justify selecting one peer-led model over another. Administrators are frequently tasked with allocating substantial resources to academic support centers without definitive proof of which specific programmatic structures actually yield the greatest cognitive and academic benefits. Therefore, the purpose of this study was to fill this evaluative gap by utilizing a highly rigorous, doubly robust propensity score methodology to isolate the causal impact of two distinct peer tutoring methods on students’ final course grades. By mathematically controlling for selection bias and baseline academic disparities, the outcome models revealed a stark divergence in the efficacy of these two interventions. At the empirically derived six-visit threshold, students who used the Embedded Tutoring model scored an estimated 3.84 points higher on their final course grades than peers who received no tutoring. Sensitivity analysis demonstrated that this specific positive effect is virtually unassailable to unmeasured confounding, yielding a highly robust E-value of 34.86. Conversely, standard Group Tutoring at this same dosage yielded a statistically significant negative impact, with students scoring an estimated 2.42 points lower than the untutored baseline group. These contrasting causal outcomes challenge the prevailing assumption that mere access to peer-led academic support is universally beneficial. Instead, the data strongly indicate that the efficacy of a tutoring intervention is heavily contingent not only on the quantity of a student's involvement but, fundamentally, on the contextual quality and structural design of that involvement.
The Involvement Threshold and Dosage Effects
The results of this study strongly validate Astin's Theory of Student Involvement, demonstrating that academic gains are contingent not merely upon access to a tutoring method, but upon a student's sustained, active participation. Analysis of the varying dosage thresholds revealed a clear cumulative effect; while attending between one and three sessions yielded positive trends, the magnitude and statistical robustness of the impact escalated significantly as students approached the six-visit threshold. This indicates that the cognitive scaffolding provided by peer educators requires repeated, continuous exposure to effectively alter a final course grade. These findings align with prior literature indicating that the impact of academic support is mediated by the amount or level of tutoring a student receives (Grillo & Leist, 2013; Rheinheimer, 2000). Specifically, the data support the existence of a “floor effect,” wherein minimal participation—often manifesting as a late-term “desperation visit” immediately preceding a final exam—provides insufficient support (Rheinheimer, 2000). Such isolated visits lack the cumulative physical and psychological energy required by Astin's framework to produce meaningful academic change. As Lidren and Meier (1991) previously demonstrated, peer tutoring systems that employ maximal, intensive structures consistently yield superior academic performances compared to minimal, infrequent interventions. This cumulative necessity is further validated by McGee (2005), who demonstrated that students categorized as having “low engagement” with peer-led academic support failed to achieve the significant grade increases observed in “high engagement” cohorts. A single drop-in session cannot replace the sustained engagement required to master complex gateway coursework.
Crucially, however, it is important to recognize that the empirically derived six-visit ceiling identified in the propensity score models represents a methodological limit of the current dataset, rather than a universal maximum for student achievement. While foundational theory emphasizes the necessity of sustained effort, contemporary research indicates that the relationship between college students’ experiences and academic outcomes often exhibits curvilinear patterns, in which the academic benefits of involvement eventually plateau or yield diminishing returns (Bowman & Trolian, 2017). Due to extreme data sparsity and a violation of the common support assumption at higher dosages, this study could not mathematically evaluate the efficacy of seven-, eight-, or more visits. Therefore, while six visits represent the crest of reliable causal inference for this specific cohort, it remains possible that an even higher dosage could yield further gains, especially as it relates to Group Tutoring. Future research utilizing larger, multiinstitutional datasets is required to fully map this curvilinear relationship and determine the true point of diminishing returns for peer tutoring of all method types.
Optimizing the Zone of Proximal Development
Having established the requisite dosage for academic impact, the stark divergence in efficacy between the two tutoring methods demands a cognitive and structural explanation. The institutional structure of this study provides a uniquely rigorous comparative environment by effectively controlling for baseline tutor quality. Both Group Tutors and Embedded Tutors possess the same historical experience (having successfully completed the course with the same professor), receive the same baseline CRLA Level 1 certification, and operate under identical supervisory structures. Therefore, the significant divergence in student outcomes must be attributed to the two variables that distinguish the embedded model: synchronous classroom presence and supplementary facilitation training.
This disparity can be understood more deeply through Vygotsky's (1978) ZPD. For cognitive scaffolding to be effective, the peer educator must operate as a highly optimized “more knowledgeable other.” While standard Group Tutors possess highly relevant historical experience, this knowledge operates externally to the current, daily pedagogical reality of the course. Conversely, because Embedded Tutoring retains the foundational structural elements of SI, the peer educator is physically situated within the primary learning environment and attends lectures synchronously with the students. This real-time contextual advantage allows the Embedded Tutor to map their scaffolding exactly to the professor's current pacing, emphasis, and localized vocabulary.
This synchronous advantage is catalyzed by the second isolated variable: specialized pedagogical training. While CRLA Level 1 certification provides an excellent foundation in basic peer tutoring techniques, Embedded Tutors at the study institution receive additional, supplementary training specifically focused on SI-adjacent facilitation. As Wilcox and Jacobs (2008) noted in their retrospective on the origins of SI, a fundamental maxim of effective peer education is that “whoever does most of the talking does most of the learning” (p. viii). Traditional group tutoring often defaults to a reactive, question-and-answer format driven by immediate student deficits, wherein the tutor inadvertently dominates the session by “reteaching” the material. In contrast, SI-facilitation training equips educators to proactively design collaborative learning activities that shift the cognitive load back onto the students. As McGuire (2006) explicitly outlines, this structural shift moves the tutor away from merely delivering content and toward actively teaching metacognitive learning strategies. Furthermore, Wilcox and Jacobs (2008) argue that maintaining the focus of these collaborative sessions requires a trained facilitator who intricately “knows both the course and the instructor” (p. ix).
Consequently, the Embedded Tutor is uniquely equipped to act as the ideal “more knowledgeable other.” Because they are in the room when the professor delivers the content that week and possess advanced facilitation training, they do not rely solely on historical memory to answer questions reactively. Instead, they can proactively model the exact note-taking strategies, metacognitive behaviors, and critical thinking skills required to succeed on the professor's upcoming assessments. This precise, synchronous, and highly trained scaffolding is what allows Embedded Tutoring to so successfully optimize the ZPD and drive significant gains in final course grades.
The Help-Seeking Divide and Selection Bias
While the cognitive advantages of synchronous facilitation explain the overwhelming success of the Embedded Tutoring method, the statistically significant negative impact (−2.42 points) of standard Group Tutoring requires a behavioral explanation. To understand why receiving traditional tutoring yielded a lower causal effect than receiving no tutoring at all, it is necessary to examine the psychology of help-seeking and the inherent selection bias in voluntary academic support. Educational research has consistently demonstrated that voluntary academic support programs suffer from severe selection bias; unadjusted regression estimates routinely misrepresent program efficacy because a student's likelihood of attending tutoring is inextricably linked to their underlying academic vulnerabilities (Bowman et al., 2023). As Bowles and Jones (2004) established in their foundational analysis of academic support, traditional models often reflect a skewed demographic because students with below-average academic ability or those experiencing acute academic distress are the populations most likely to seek external help.
This phenomenon is deeply rooted in the psychology of achievement goals and the perceived “costs” of seeking assistance. Recent literature examining self-regulated learning strategies indicates that a student's willingness to engage in instrumental help-seeking, seeking assistance to master the learning process rather than just securing an immediate answer, is heavily dictated by their motivational profile, including factors such as academic self-efficacy and locus of control (Bean & Eaton, 2001; Drago et al., 2018). Students overly concerned with performance goals often perceive high social and psychological “costs” associated with asking for help, viewing it as an admission of incompetence rather than a tool for mastery (Gonida et al., 2019). This behavioral reality perfectly contextualizes the negative impact of the standard Group Tutoring method. Standard Group Tutoring is an isolated, external service that requires a student to independently recognize a knowledge deficit, overcome the social stigma of seeking remedial help, and physically travel to a separate location. Consequently, the population utilizing Group Tutoring is heavily skewed toward students lacking college-level self-regulation who are already in a state of severe, immediate academic crisis. This is empirically evident in the significant unweighted baseline demographics of this study: first-semester freshmen (as indicated by a high school GPA source) comprised a staggering 54.5% of the standard Group Tutoring cohort, compared to only 24.8% of the Embedded cohort. This suggests that students without established academic experience frequently delay seeking external help until an acute failure occurs, driving the “desperation visit” phenomenon. While the doubly robust ATO modeling successfully controlled for this baseline experience, it is highly probable that the −2.42 score reflects a latent, unmeasured variable of mid-semester academic distress that drove these specific students to seek standard tutoring.
While prior comparative studies have attempted to evaluate different structural models, they frequently encounter these exact methodological limitations regarding selection bias and ecological validity. For instance, Mendes et al. (2017) evaluated a shift from Supplemental Instruction to Weekly Tutoring Groups, finding that the structured group model drove more consistent attendance. However, their model relied on an informal contractual agreement where students who missed more than two sessions were subject to losing their spot. While such punitive policies may inflate participation metrics, they do not reflect natural, voluntary help-seeking behaviors. Furthermore, Mendes et al. (2017) explicitly acknowledged that their evaluation lacked a true experimental format and that volunteering students may have inherently possessed a confounding “achievement-ambition” factor. The present study resolves these limitations by evaluating a strictly opt-in, penalty-free environment to provide an ecologically valid measure of natural student engagement. To account for the exact unmeasured motivation factors that hindered Mendes et al., (2017) this study incorporates an E-value (34.86), mathematically demonstrating that an unmeasured confounder would need an impossibly high association to negate the Embedded Tutoring findings. This statistical robustness is explicitly supported by prior empirical research. In an extensive evaluation of help-seeking behaviors, McGee (2005) utilized the Motivated Strategies for Learning Questionnaire to demonstrate that highly engaged academic support students achieved significantly higher final course grades even when baseline differences in student motivation were strictly controlled for. Consequently, although motivation remained an unmeasured covariate in this retrospective dataset, McGee's findings support the conclusion that the 3.84-point increase is driven by the structural intervention itself, not merely by the preexisting drive of the students who attended.
Ultimately, the Embedded Tutoring method structurally dismantles this barrier to help-seeking. Because the peer educator is a visible, integrated component of the daily classroom environment, the intervention normalizes help-seeking behavior for the entire cohort. This normalization is reflected in its ability to reach a broader, less traditional tutoring demographic; the significant baseline differences in student sex indicate that the Embedded model successfully captured a higher proportion of male students (15.9%) compared to standard Group Tutoring (9.3%), a demographic that traditionally underutilizes voluntary academic support centers. Furthermore, historically, traditional academic support programs have struggled with a stigmatized demographic; as McGee (2005) demonstrated, students engaging in academic support historically entered with significantly lower mean SAT scores than nonparticipants, indicating a population acting from a deficit mindset. However, the Embedded Tutoring model evaluated in this study successfully disrupted this historical trend. Prior to propensity score adjustment, the unweighted baseline characteristics revealed that students who self-selected into Embedded Tutoring actually entered the course with a higher cumulative GPA than their peers. By removing the external effort required to seek standard tutoring, the Embedded method successfully normalized academic support, capturing a broader, highly balanced population and preventing the concentration of academic distress that limits traditional group models.
Administrative Implications and the Keimig Hierarchy
The stark divergence in causal outcomes between the two tutoring methods necessitates a critical, data-driven reevaluation of how institutions structure and market peer-led academic support. While theoretical discussions of cognitive scaffolding and selection bias are academically vital, university administrators require practical frameworks to optimize resource allocation. The findings of this study provide definitive, empirical justification for evolving traditional academic support services, utilizing Keimig's (1983) Hierarchy of Learning Improvement Programs as an evaluative lens.
As Arendale (2022) notes in his application of Keimig's hierarchy, traditional models like the Group Tutoring evaluated in this study operate at Tier II: Learning Assistance for Individual Students. This tier fundamentally relies on isolated, disparate services, with the onus entirely on the student to independently diagnose a knowledge deficit and seek external remediation. As the negative causal impact (−2.42 points) and the analysis of help-seeking behaviors demonstrate, this isolated structure often fails to reach students proactively, inadvertently capturing a heavily skewed population already in a state of late-semester academic crisis.
Conversely, the Embedded Tutoring method operates at a significantly higher level, aligning with Keimig's Tier III: Course-Related Learning Services. Because the SI-adjacent peer educator is physically present during lectures, academic support is no longer a separate, stigmatized service. Instead, it is structurally woven into the course itself, successfully normalizing help-seeking behavior for the entire cohort and capturing a broader, more balanced population before academic distress sets in. While this method remains one step below Keimig's theoretical ideal of a Tier IV: Comprehensive Learning System (which requires structural changes to institutional curricula), Tier III interventions represent the highest standard of programmatic efficacy readily available to learning center administrators. While Tier IV interventions, like Peer-Led Team Learning (Arendale, 2022), are possible, they are much more challenging and costly to implement.
Crucially, the administrative mandate derived from these findings is not to abandon or reduce standard Group Tutoring, which remains a cornerstone of institutional support. Rather, the mandate is to initiate a programmatic evolution that bridges the gap between Tier II and Tier III services. Administrators must aggressively apply the successful cognitive and behavioral elements of the SI model to modernize standard tutoring operations.
First, to combat selection bias and the “desperation visit” phenomenon, institutions must overhaul how standard tutoring is marketed. Acknowledging the empirically derived six-visit threshold necessary for significant academic impact, learning centers must deploy targeted, early semester outreach campaigns that normalize help-seeking and explicitly incentivize sustained engagement from week one. Second, administrators must bridge the cognitive gap by upgrading standard tutor training. The advanced, SI-adjacent facilitation training that empowered these Embedded Tutors to teach metacognitive strategies (McGuire, 2006) should be integrated into baseline CRLA certification for all peer educators. By equipping all tutors with proactive facilitation techniques rather than relying on reactive reteaching, institutions can successfully elevate traditional tutoring, maximizing student academic outcomes across all support methods.
Finally, the neutralizing effect observed within the “Both Tutoring Types” cohort provides a clear mandate for cross-referral pathways. When learning center staff identify students utilizing standard Group Tutoring who are exhibiting signs of severe academic distress, administrators should establish protocols to actively recruit and transition these students into concurrent Embedded Tutoring sessions. The data suggest that pairing the isolated, remedial support of Group Tutoring with the proactive, metacognitive scaffolding of the Embedded model can successfully arrest a student's academic decline and return them to the institutional baseline, effectively transforming an acute drop-out risk into a retained student.
Limitations and Future Research
While the methodological design of this study utilized advanced propensity score techniques to establish causal inference, several limitations must be acknowledged. First, because this research employs a retrospective, observational cohort design rather than a Randomized Controlled Trial, the findings remain quasi-experimental. The ATO weighting and subsequent doubly robust estimation successfully achieved strict balance across all measured baseline covariates; however, observational data cannot entirely rule out the influence of unmeasured confounding variables. While the calculation of E-values, and their alignment with prior empirical research, demonstrated that the findings are highly robust against latent constructs such as student motivation, absolute causality is inherently limited by the variables captured within the institutional database. Furthermore, this study was conducted at a single public R1 research institution, which may limit the generalizability of the findings to different institutional types, such as community colleges or private liberal arts universities.
A third limitation involves the unmeasured heterogeneity of faculty behavior regarding tutoring referrals. To establish a baseline of consistency, all participating faculty agreed to the embedded model and committed to encouraging student participation. Furthermore, uniform promotional practices were implemented across all course sections, including in-class tutor introductions, syllabus integrations, and standardized announcements via the university's learning management system. Crucially, no participating faculty offered extra credit, grading incentives, or mandatory attendance requirements tied to the tutoring programs; participation remained strictly opt-in across all courses. However, the precise frequency and intensity of individualized faculty encouragement, such as informal referrals during private office hours or personalized email outreach, could not be systematically observed. While this variability in faculty-driven help-seeking encouragement represents a limitation, the resulting E-value (34.86) indicates that such unmeasured confounders would need to exert an exceptionally massive influence on both the treatment assignment and the outcome to invalidate the observed causal effects.
A final limitation is that the data was analyzed as a single, combined cohort rather than disaggregated by specific academic disciplines. While the dataset encompasses a diverse range of subjects, evaluating the impact of sustained tutoring, specifically at the three- and six-session thresholds, requires a robust sample size to ensure statistical accuracy. Breaking the data down by individual courses or instructors resulted in groups that were simply too small to reliably measure the true causal effect of the tutoring models. Consequently, while the aggregated findings demonstrate that the structural model is effective across high-risk courses generally, future research utilizing larger, multiinstitutional datasets is required to explore how these outcomes might vary between specific disciplines.
These limitations provide clear avenues for future research. First, subsequent studies should utilize multiinstitutional datasets to further investigate the dosage effects of academic support. Because data sparsity limited this study's analysis to a six-visit ceiling, larger cohorts are required to fully map the curvilinear relationship between student involvement and to identify the exact threshold of diminishing returns for peer tutoring. Additionally, future research should explicitly isolate the variable of tutor training. While this study demonstrated the superiority of a model combining CRLA certification with SI facilitation strategies, more granular research is needed to determine which metacognitive training modules yield the greatest gains in student achievement.
Conclusion
As higher education institutions continue to face increasing pressure to optimize resources and improve student retention, the evaluation of academic support services must move beyond simple utilization metrics and toward rigorous causal analysis. While traditional, small-group tutoring remains a prevalent model in higher education, its ubiquity is frequently driven by pragmatic administrative constraints, namely affordability and generalized accessibility, rather than optimal academic efficacy. Furthermore, transitioning to highly effective, course-embedded models present significant cultural and structural challenges, most notably the necessity of securing robust faculty buy-in, which may not be immediately feasible in all institutional environments. However, while acknowledging these very real operational and political constraints, the findings of this study demonstrate that simply providing access to nonintegrated, traditional tutoring is often insufficient to neutralize severe academic deficits. Ultimately, driving true academic recovery in high-risk courses requires institutions to navigate these structural barriers and prioritize the deeper faculty partnerships necessary for course-embedded support. Instead, the data reveal that the efficacy of peer-led intervention is determined by the structural context of delivery and the educator's pedagogical training. By transitioning from isolated, Tier II group tutoring models to Tier III course-embedded structures, institutions can successfully dismantle help-seeking stigma, reach a broader range of students, and optimize the Zone of Proximal Development. Ultimately, maximizing student achievement and driving long-term persistence requires institutions to proactively weave highly trained, synchronous cognitive scaffolding directly into the fabric of the learning environment.
Footnotes
Acknowledgments
The author acknowledges the use of Google Gemini Pro 3.1 to create and correct RStudio Code for ATO Propensity Score Weighting and to generate Love plots. Gemini did not help interpret or read the output beyond troubleshooting error messages and confirming weight balances. AI was not used in the development of the manuscript or interpretation of data. All errors in this manuscript are human-made. The author extends special thanks to Dr. Tracy Hodges at the Center for Research Excellence and Innovation at Sam Houston State University for her guidance and mentorship throughout this process.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
