Instructional Interventions Affecting Critical Thinking Skills and Dispositions: A Stage 1 Meta-Analysis

Abstract

Critical thinking (CT), or the ability to engage in purposeful, self-regulatory judgment, is widely recognized as an important, even essential, skill. This article describes an ongoing meta-analysis that summarizes the available empirical evidence on the impact of instruction on the development and enhancement of critical thinking skills and dispositions. We found 117 studies based on 20,698 participants, which yielded 161 effects with an average effect size (g+) of 0.341 and a standard deviation of 0.610. The distribution was highly heterogeneous (Q _T = 1,767.86, p < .001). There was, however, little variation due to research design, so we neither separated studies according to their methodological quality nor used any statistical adjustment for the corresponding effect sizes. Type of CT intervention and pedagogical grounding were substantially related to fluctuations in CT effects sizes, together accounting for 32% of the variance. These findings make it clear that improvement in students’ CT skills and dispositions cannot be a matter of implicit expectation. As important as the development of CT skills is considered to be, educators must take steps to make CT objectives explicit in courses and also to include them in both preservice and in-service training and faculty development.

Keywords

critical thinking meta-analysis achievement

Critical thinking (CT), or the ability to engage in purposeful, self-regulatory judgment, is widely recognized as an essential skill for the knowledge age. Most educators would agree that learning to think critically is among the most desirable goals of formal schooling. This includes not only thinking about important problems within disciplinary areas, such as history, science, and mathematics, but also thinking about the social, political, and ethical challenges of everyday life in a multifaceted and increasingly complex world. Bailin and Siegel (2003) argued that “critical thinking is often regarded as a fundamental aim and an overriding ideal of education” (p. 188), and Sheffler (1973) contended that “critical thinking is of the first importance in the conception and organization of educational activities” (p. 1).

As well as being better students, a short-lived advantage, critical thinkers have a better future as functional and contributing adults. Let two examples suffice. Parents who can think through the challenges of child rearing and who act accordingly have a better chance of propagating a societal future of harmony and well-being. At a broader societal level, a democracy composed of citizens who can think for themselves on the basis of evidence and concomitant analysis, rather than emotion, prejudice, or dogma, is a plus—in fact, it sustains, builds, and perpetuates the democracy.

Educators are not alone in their concern about the urgency of teaching and learning CT. In the United States, “a national survey of employers, policymakers, and educators found consensus that the dispositional as well as the skills dimension of critical thinking should be considered an essential outcome of a college education” (Tsui, 2002, 740–741). In Canada, the cross-country consultation on the Canadian federal government’s highly influential “Innovation, Knowledge, and Skills” policy recommended that schools, colleges and universities “should promote critical thinking . . . at all levels of education” (Government of Canada, 2002, n.p.). The Conference Board of Canada expressed the need for Canadians to improve their CT skills to strengthen Canada’s innovation profile and competitive advantage in the knowledge-based global economy (Bloom & Watt, 2003).

What Is CT?

CT is by no means a new concept, but part of the problem facing practitioners and researchers alike is that it is a complex and controversial notion that is difficult to define and, consequently, to study. Furthermore, the tools of implementation (instructional interventions) are difficult to operationalize. Many definitions of CT have been proposed (Ennis, 1962, 1987; Facione, 1990a; Kurfiss, 1988; Lipman, 1991; Paul & Binker, 1990; Scriven & Paul, 1996; Siegel, 1988). One high-profile definition was developed by an American Philosophical Association Delphi panel of 46 experts, including leading CT scholars such as Ennis, Facione, and Paul:

We understand critical thinking to be purposeful, self-regulatory judgment which results in interpretation, analysis, evaluation, and inference, as well as explanation of the evidential, conceptual, methodological, criteriological, or contextual considerations upon which that judgment is based. . . . The ideal critical thinker is habitually inquisitive, well-informed, trustful of reason, open-minded, flexible, fair-minded in evaluation, honest in facing personal biases, prudent in making judgments, willing to reconsider . . . and persistent in seeking results which are as precise as the subject and the circumstances of inquiry permit. (Facione, 1990a, p. 3)

The Delphi Committee identified six skills (interpretation, analysis, evaluation, inference, explanation, and self-regulation), 16 subskills, and 19 dispositions (including inquisitiveness, open-mindedness, understanding others, and so on) that they associated with CT. These skills and dispositions provide a complex normative framework for understanding and assessing the qualities of human cognition.

Within CT theory and research, different conceptual traditions have emerged. According to psychological views, CT requires gaining mastery of a series of discrete skills or mental operations and dispositions that can be generalized across a variety of contexts. These skills include concepts such as interpreting, predicting, analyzing, and evaluating. The obvious and primary appeal of the skills discourse on CT involves its transfer between contexts. D. N. Perkins and Salomon (1988) reflect the skills approach by claiming, “Students often fail to apply knowledge and skills learned in one context to other situations. With well-designed instruction we can increase the likelihood that they will” (p. 22).

Bailin and Siegel (2003) reflect the philosophical tradition by arguing that psychological conceptions of CT are problematic for three reasons: (a) It is impossible to determine whether particular mental operations correlate with particular cases of good thinking, (b) there is no particular set of procedures that is either necessary or sufficient for CT, and (c) terms denoting thinking (for example, evaluating, interpreting, analyzing) refer not to mental operations or processes but rather to different tasks requiring thinking (p. 181). They suggest CT is a normative concept that requires mastery of context-specific knowledge to evaluate specific beliefs, claims, and actions. From this perspective, critical normally means making sound judgments and claims that meet epistemologically acceptable standards. Siegel (1988) characterizes CT as being “appropriately moved by reasons” (p. 23) and emphasizes particular criteria that must be satisfied for reasons to count as appropriate. Paul (1990) adopts a similar view of CT by suggesting it requires the ability and disposition to evaluate beliefs effectively and to identify and assess their underlying assumptions.

How Is CT Measured?

The question of how CT is measured is as important as understanding how CT can be taught and learned. How will we know if one intervention is more beneficial than another if we are uncertain about the validity and reliability of outcome measures? Researchers in education and human cognition have employed numerous assessment tools that cover a broad range of formats, origins, psychometric characteristics, areas of application, and scope of constructs to be measured. This situation presents serious challenges in identifying, categorizing, and evaluating learners’ outcomes in empirical research on CT in education. Even when researchers explicitly declare that they are assessing CT, there still remains the major challenge of ensuring that measured outcomes represent the construct of CT according to the operational definition adopted by reviewers.

There is also concern about the psychometric properties of the existing standardized measures of CT, such as the Watson-Glaser Critical Thinking Appraisal (WGCTA; G. Watson & Glaser, 1980), Cornell Critical Thinking Test (Ennis & Millman, 1985a, 1985b), California Critical Thinking Skills Test (Facione, 1990), and California Critical Thinking Disposition Inventory (Facione & Facione, 1992). A great deal of research has been undertaken to establish the validity and reliability of these measures—more for the WGCTA because it is the oldest—and the overall results are inconsistent (to say the least). In a recent analysis, Bernard et al. (2008) explored the intercorrelational structure of the subscales of the WGCTA, performing a factor analysis of the means of 70 studies of two versions of the test, and found that the best interpretation of these versions is represented by the total score rather than by the individual scores on the five subscales.

Is CT a Generic Skill?

No one would argue that CT is applicable across a range of disciplinary areas, but there is little consensus about whether it is a set of generic skills that apply across subject domains (engineering, arts, science) or whether it depends on the subject domain and context in which it is taught (Ennis, 1989). If it is generic, then CT should be taught in specialized courses that focus on CT skills (Royalty, 1991; Sá, Stanovich, & West, 1999). If CT is dependent on subject matter, then it should be learned by tackling concrete problems in specific disciplines (Halliday, 2000; G. Smith, 2002).

The generalist view, supported by Siegel (1988), contends that identifying informal fallacies of reasoning, such as post hoc and hasty generalization fallacies, is transferable between different contexts. In his view, CT is partially defined by detecting such fallacies without regard to specific subject matter, because errors of reasoning are based on argument design rather than content. Alternatively, the specifist position, represented in the views of McPeck (1981), argues against general CT capacities on logical grounds, because thinking is always linked to a specific subject domain:

It makes no sense to talk about critical thinking as a distinct subject and therefore cannot profitably be taught as such. To the extent that critical thinking is not about a specific subject X, it is both conceptually and practically empty. (p. 5)

What Do We Know About Teaching Students to Think Critically?

It is not surprising that a search of the literature reveals thousands of documents devoted to issues related to the teaching of CT. There have been several reviews focusing on the effects of instruction on CT. For example, Adams (1999) summarized studies that reviewed changes in the CT abilities of professional nursing students. Allen, Berkowitz, Hunt, and Louden (1997 Allen, Berkowitz, Hunt, and Louden (1999) studied the impact of various methods for improving public communication skills on CT. Bangert-Drowns and Bankert (1990) reported some effects of explicit instruction for CT on measures of CT. Follert and Colbert (1983) analyzed research on debate training and CT improvements. Pithers and Soden (2000) reviewed methods and conceptions of teaching likely to inhibit or enhance CT. McMillan (1987) considered the effects of instructional methods, courses, programs, and general college experiences on changes in college students’ CT. The Assessment and Learning Research Synthesis Group (2003) at University of London reviewed the impact of information and communication technologies on creative and CT skills.

What Are Some Instructional Approaches to Teaching CT?

In our analysis of CT instructional approaches, we will use Ennis’s (1989) CT typology of four courses—general, infusion, immersion, and mixed—for classifying and describing various instructional interventions. In the general course, CT skills and dispositions are learning objectives, without specific subject matter content. In contrast, content is important in both the infusion and immersion approaches. CT is an explicit objective in the infusion course but not in the immersion course. In the mixed approach, CT is taught as an independent track within a specific subject matter. These four approaches—general, infusion, immersion, and mixed—will be assessed in this review for their instructional efficacy.

According to Ennis (1989), the general approach attempts to teach CT abilities and dispositions separately from the presentation of the content of existing subject matter offerings, with the purpose of teaching CT. Examples of the general approach usually do involve some content but do not require that there be content.

Using a quasiexperimental design, Riesenmy, Mitchell, Hudgins, and Ebel (1991) taught self-directed CT directly (using Ennis’s [1989] general method) to 70 fourth-and fifth-grade students in St. Louis public schools. The authors expected that students who were taught the roles of four modes of thinking (task definer, strategist, monitor, and challenger) would perform better on a problem-solving posttest that demanded both lateral and vertical transfer of general thinking skills. This prediction was fulfilled by the results. Three groups of treated students outscored the control group: The group that wrote immediate posttests had greatly superior scores on average, a second group tested 4 weeks later outscored the controls, and a third group, tested 8 weeks later, also outperformed the control students.

The infusion of CT requires deep, thoughtful, and well-understood subject matter instruction in which students are encouraged to think critically in the subject. Importantly, in addition, general principles of CT skills and dispositions are made explicit.

Zohar, Weinberger, and Tamir (1994) used the infusion method in the Biology Critical Thinking Project to support seventh-grade biology students in Israel in developing their CT skills (which included recognizing logical fallacies, distinguishing between experimental findings and conclusions based on findings, identifying tacit and explicit assumptions, avoiding tautologies, isolating variables, testing hypotheses, and identifying relevant information). A true experimental design was used to test the efficacy of the program on two dependent variables developed for this study, a general CT test (administered before and after the training) and a biology CT test (posttest only). Average scores were reported for nearly 500 students, and the results were highly favorable for the program, as experimental students registered much higher average gain scores on the biology CT test and also outperformed the control group on the general CT test.

In the immersion approach, subject matter instruction is thought-provoking, and students do get immersed in the subject. However, in contrast to the infusion approach, general CT principles are not made explicit.

An example of CT immersion during instruction is provided by Kamin, O’Sullivan, and Deterding (2002), who used digital video case simulations followed by group discussions as an instructional method. One group of students who viewed the cases on video discussed the case online, a second group saw the videos and discussed them face-to-face, and a third group received a text account of the case (rather than a video) and participated in face-to-face discussions. Content analysis of the discussions was used to assess CT demonstrated by each group; results showed that although video presentation seemed to facilitate CT, the online discussion group scored highest. The authors suggested that the written discussion format provided better opportunities for the students to concentrate on articulating their ideas.

The mixed approach consists of a combination of the general approach with either the infusion or immersion approach. Under it, students are involved in subject-specific CT instruction, but there is also a separate thread or course aimed at teaching general principles of CT.

In a successful attempt at implementing Ennis’s (1989) mixed CT instructional strategy, McCarthy-Tucker (1998) reported that high school freshman and sophomore students in English and algebra who received instruction in formal logic as a supplement to their curricular instruction showed much greater improvement (from pretest to posttest) on two standardized measures of thinking, the Test of Logical Thinking (TLT) and the Content Specific Test of Logic (CSTL), than untreated control participants.

Goals of This Systematic Review

This article describes an ongoing meta-analysis that summarizes the available empirical evidence we have examined to date on CT development and utilization in educational contexts. The core research question is: What instructional interventions have an effect on the development and effective use of CT skills and dispositions, and to what extent, and under what circumstances? Our objectives are as follows: First, we will summarize the evidence on the impact of instruction on CT. Second, we will examine how certain methodological aspects of individual studies (such as research design, type of CT measure, and the method of the effect size extraction) moderate the magnitude of this impact. We want to know, for example, whether the effect of instruction has generally been found to be smaller in true experiments and larger in preexperimental studies. We are also interested in how effect sizes vary with CT measures, expecting that the impact of instruction will be smallest on standardized tests and largest on teacher-made tests.

Third, the role of several substantive study features will be analyzed. In particular, we want to know how different types of instructional interventions affect CT skills, what impact pedagogical background (e.g., instructor training) has, and how calculated effect sizes vary with age and educational level and whether collaborative work was part of the treatment.

By summarizing our work completed to date, we hope to inform the research and practitioner communities of the findings and also to solicit feedback from scholars and educators about the next stages in the review, including alternative and fine-grained analyses. We hope eventually to develop an empirically validated model of effective CT instruction.

Method

Literature Search Strategies and Data Sources

An extensive literature search was designed to identify and retrieve primary empirical studies relevant to the project’s major research question. The databases searched were ERIC, ProQuest Dissertations and Theses, PsycInfo, ABI/Inform Global on ProQuest, AACE Digital Library, OCLC PAIS International, EBSCO EconLit, EBSCO Academic Search Premier, Social Science Index, CSA Social Services Abstracts, and CSA Sociological Abstracts. The bibliographies of review articles and previous meta-analyses were scanned. In addition, the Educational Testing Service Test Collection database (see http://www.ets.org/portal/site/ets/) and ERIC were consulted to locate a comprehensive list of standardized tests that measure CT. The Web of Science database was also consulted to locate studies exploring the factor structure of CT standardized assessment tools. In this regard, a cited reference search was performed on the following authors: Clifford, Follman, Frisby, Harris, Johnson, Loo, Lowe, McMurray, Ross, Simon, and Whimbey.

The descriptor critical thinking was used when possible; otherwise, it was searched as a keyword. For noneducation databases, this concept was combined with the search statement educat* or teach* or learn* or student*. All results were limited to empirical studies when possible (e.g., the document type code 143 was used in ERIC), or the abovementioned search statements were combined with the keywords: control group or compar* or study or studies.

Search results span the 1960s through 2005, although the entire year of 2005 has not been searched, as the final searches occurred in the summer of 2005. A total of 3,720 studies were found.

Although the bulk of studies have likely already been identified, in the next stage of this project, all searches will be updated and further searches will be conducted in the following sources: British Education Index; Australian Education Index; CBCA Education; Education: A Sage Full-Text Collection; PubMed; Francis; various databases on the Evidence Network; and various Web search engines, including Google and Google Scholar.

Inclusion and Exclusion Criteria and Review Procedure

Decisions regarding whether to retrieve an article were based on a review of study abstracts. Decisions about whether to include studies in the review were based on reading the full text of the article. Both decisions were made by two researchers working independently who then met to discuss their judgments and to document their agreement rate. The following inclusion criteria were used: (a) accessibility—the study must be publicly available or archived; (b) relevancy—the study addresses the issue of CT development, improvement, and/or active use; (c) presence of intervention—the study presents some kind of instructional intervention; (d) comparison—the study compares outcomes that resulted from different types or levels of treatment (e.g., control group and experimental group, pretest and posttest, etc.); (e) quantitative data sufficiency—measures of relevant dependent variables are reported in a way that enables effect size extraction or estimation; (f) duration—the treatment in total lasted at least 3 hr; and (g) age—participants were no younger than 6 years old. If any of these criteria were not met, the study was rejected.

Along with studies employing experimental and quasiexperimental designs, we also included studies that used preexperimental designs (e.g., one-group pretest–posttest designs). Our intent was not to eliminate methodologically less sound studies a priori but instead to code for research methodology. Including studies of variable quality meant we could determine empirically whether and in what ways the findings differed by research design and, as necessary, use weighted multiple regression to remove variance associated with study quality. In fact, the bulk of research on CT does not consist of true experiments with “randomized control trials.” This is largely because CT research has been conducted in classrooms, where randomization is difficult to achieve. Our approach allowed greater flexibility and greater statistical power in the analyses, because potentially promising findings from more typical applications (e.g., studies with higher external validity) are not excluded in advance of conducting the review (Abrami & Bernard, 2008).

Abstracts and full-text research reports were each coded for inclusion or exclusion by two raters. An individual coder’s ratings, at each stage, were specified to range from 1 (the study is definitely unsuitable for the purposes of the project) to 5 (the study is definitely suitable for the purposes of the project); the midpoint of 3 (doubtful but possibly suitable) was designated as a vote in favor of including the study. In other words, ratings from 3 to 5 suggested either the retrieval of the full-text document (at the abstract review stage) or inclusion of the study in further analyses (at the full-text review stage), whereas ratings of 1 or 2 meant the elimination of the study from further consideration. Interrater agreement rate at these two stages of the review was calculated and reported in two different ways: (a) as a coefficient of correlation between ratings given by independent coders across all reviewed papers and (b) as a percentage of studies, with respect to which both coders agreed whether to reject the study or to continue analyzing it.

The extent of uniformity between coders was also documented with regard to effect size extraction and to the calculation of study features. Each effect size was coded by two raters, and two agreement rates were produced: (a) a number between 50 and 100 was assigned to each study to reflect the degree of agreement between the raters with regard to how many effect sizes should be extracted from each study, and this number was averaged across studies; and (b) a similar procedure was applied with regard to agreement as to which calculation procedures should be used to determine each effect size. As for study features coding, each study was assigned a rating according to the percentage of the features on which the raters initially agreed; all disagreements were discussed until a final accord was negotiated. All agreement rates were averaged across studies, and the average rates are presented in the Results section below.

Measuring CT

The issue of measuring CT is a complex one. Our research team (Bernard et al., 2008) factor analyzed CT subscale weighted means from 60 data sets for the WGCTA. A strong general factor emerged for two versions of the WGCTA, and so for this review, we focused on global indices of CT only instead of on individual subscale means that are often reported in the literature where the WGCTA is used. Although we have not performed a similar analysis on other popular standardized CT measures, our findings and decision as to how to proceed with the WGCTA suggested that we should follow this strategy uniformly.

To account for variation in CT assessment tools, we decided to categorize them as follows.

Standardized tests. These are well-established measures of CT or particular thinking skills and dispositions, such as WGCTA, Cornell Critical Thinking Test, California Critical Thinking Skills Test, and California Critical Thinking Disposition Inventory.

Tests developed and evaluations conducted by a teacher. This category includes, for example, the content analysis of students’ responses to interview questions, open-ended questions, and essay type of tasks teachers used to address CT skills development in their students.

Tests developed by researchers (one or more of the study authors). These are nonstandardized measures developed by a researcher for use in a particular study, for example, Bonk, Angeli, Malikowski, and Supplee (2001) and VanTassel-Baska, Zuo, Avery, and Little (2002).

Tests developed by researchers (one of the study authors) who also taught the courses in question. These are developed by a researcher who was also the teacher or instructor of record. For example, in one of our included studies, the researchers (Zohar & Tamir, 1993) developed the Critical Thinking Application Test to evaluate performance in reasoning skills.

Secondary-source measures. These instruments are usually adopted from other sources with or without modifications. Researchers may use previously developed (standardized or unstandardized) instruments or modify them to meet the requirements of their research setting. For example, Feuerstein (1999) adapted the Language and Media Test developed by an Australian research group (Quin & McMahon, 1993) to suit her Israeli study according to local media texts.

Effect Size Extraction

Effect size is a standardized metric expressing the difference in two group means (usually a control and a treatment). Cohen’s d (1) is the biased estimator of effect size. $d_{i} = \frac{{\bar{X}}_{Experimental} - {\bar{X}}_{Control}}{S_{Pooled}}$ (1)

There are also two modifications of this basic equation: one for studies reporting pretest data for both experimental and control groups and another for a single-group pretest–posttest design. In other cases (e.g., t tests, F tests, p levels), effect size is estimated using conversion formulas provided by Glass, McGaw, and Smith (1981) and Hedges, Shymansky, and Woodworth (1989).

To correct for bias in small samples, d was converted to the unbiased estimator g (Hedges & Olkin, 1985), as follows: $g ≅ (1 - \frac{3}{4 N - 9}) d$ (2)

The Q _T statistic was used to test for homogeneity of effect sizes (Hedges & Olkin, 1985). Q _T is a homogeneity statistic that is most commonly used in assessing a collection of effect sizes or correlation coefficients. When all findings share the same population value, Q _T has an approximate χ ² distribution with k – 1 degrees of freedom, where k is the number of effect sizes or correlations. If the obtained Q _T value is larger than the critical value, the findings are determined to be significantly heterogeneous, meaning that there is more variability in the effect sizes or correlations than chance fluctuation would allow around a single population parameter.

Q _T is a notoriously sensitive measure of heterogeneity that becomes more sensitive as sample size increases. Higgins and Thompson (2002), and more recently, Huedo-Medina, Sanchez-Meca, Marin-Martinez, and Botella (2006), recommend the use of I ² as a complement to the interpretation of Q _T . I ² represents heterogeneity in proportional terms as the percentage of variability in point estimates that is caused by heterogeneity rather than sampling error. Higgins and Thompson tentatively suggest the following interpretations of I ²: “Mild heterogeneity might account for less than 30 per cent of the variability in point estimates, and notable heterogeneity substantially more than 50 per cent” (p. 1553).

Study Features

To explain variability in effect sizes, coded study features were individually assessed and also entered collectively in weighted multiple regressions. The following methodological features were tested: (a) type of research design (preex-perimental, quasiexperimental, or true experimental), (b) type of CT measure (standardized, teacher-made, researcher-made, teacher- and researcher-made, or secondary-source measures), and (c) effect size extraction method (calculated from descriptive statistics, estimated with no assumptions made, or estimated with assumptions). The following substantive features were coded: (a) age level of participants (elementary or 6–11 years, early secondary or 11–15 years, high school or 16–18 years, undergraduate education, graduate education, or adult learners outside of formal school settings), (b) intervention type (general, infusion, immersion, or mixed), (c) pedagogical grounding of the intervention (instructor receives special training for teaching CT; extensive observations of course activities were reported to describe their relevance to CT skills development; the course curriculum was described in detail, including how its components were linked to the objective of CT skills development; or CT was simply declared to be among course the objectives, with no provision of any supporting information), and (d) peer collaboration (collaborative learning was part of the course, or no indication of collaborative work was reported).

Results

To date, searches of ERIC, PsycInfo, ABI, and Dissertation Abstracts International revealed 3,720 studies that were potentially suitable for further scrutiny. Judging from the review of abstracts, we marked 1,380 studies for retrieval (for dissertations, a representative sample was selected), of which 158 were retained after full-text review. This report includes data from 117 studies that were selected for inclusion. This is a much larger number of studies than any other review has previously included. Interrater agreement was as follows:

Abstract review, 94.9% (r = .794, p < .01)

Full-text review, 94.6% (r = .833, p < .01)

Agreement on numbers and categorization of effect sizes, 94.1%

Agreement rate on effect size calculation, 98.5%

Agreement rate on study features coding, 84.1%

Overall Analysis

The 117 studies yielded 161 independent effect sizes, including 27 effect sizes from true experiments, 74 effects sizes from quasi experiments, and 60 effect sizes from preexperiments. Figure 1 shows the distribution of g, and Tables 1 and 2 show the general statistical analyses associated with it. The distribution is positively skewed (skewness = 1.606) and platykurtic (kurtosis = 2.876), with an average effect size (g+) of 0.341 (k = 161, N = 20,698) and a standard deviation of 0.610. The standard error of the mean is 0.01. This represents an average advantage of about one third of a standard deviation for the treatment over the control. The distribution is highly heterogeneous (Q _T = 1,767.86, p < .001). Because the I ² estimate here is greater than 0.90, we conclude that the distribution suffers from severe heterogeneity. This substantially weakens any claim that the average effect size is representative of population parameters. Significant heterogeneity does, however, open up the possibility of exploring effect size variability in terms of other study features.

Overall, the moderate average effect is sympathetic with the view that instruction improves CT skills and dispositions. But the large variability means the finding is neither uniform nor consistent, requiring further exploration to determine whether methodological and substantive features explain the differences among study findings and can account for the wide range of findings (i.e., from –1.0 to +2.75).

Outlier Analysis and Publication Bias

Outlier analysis. Outlier analysis seeks to determine if the removal of a certain number of effect sizes from a distribution of effect sizes greatly increases the fit of the remaining effect sizes to a simple model of homogeneity without drastically affecting substantive interpretation of the recalculated mean effect size. Three approaches are described by Hedges and Olkin (1985): one that involves a visual examination of a forest plot of the data, another that involves examining relative residuals, and a third that is based on the magnitude of the Q statistic for each effect size, in conjunction with what is known as a “one-study-removed” analysis. These methods will not always agree as to the outliers that should be considered for deletion, but according to Hedges and Olkin, they often do.

This review is marked by high heterogeneity of effect size (Q _T = 1,767.86) around an average effect size of g+ = 0.34. An examination of the results of the one study removed analysis revealed that the removal of any one effect size did not substantially affect g+. Only two studies on the negative tail raised g+ by as much as 0.03; one study on the positive tail lowered g+ by 0.02, and another lowered it by 0.04. All of these recalculated results remain within the 95th confidence interval of the original g+.

Examination of the Q statistic and relative residuals for each effect size revealed eight potentially outlying effect sizes, all related to large-sample studies (i.e., standard errors less than 0.10). All of these effect sizes had a Q statistic greater than 100. Three had a Q statistic above 200. Table 3 shows the changes to both g+ and to Q _T when three and eight effect sizes are removed.

While the reduction of Q _T is substantial, especially when eight effect sizes are removed (33%), neither removal comes close to producing a better-specified model (i.e., homogeneous). In addition, the removals produce an increase in g+, in both cases, of more than 20% (i.e., 22% and 23%, respectively). Both of these recalculated average effect sizes are above the 95th confidence interval of the original g+ of 0.34. Therefore, we decided not to remove effect sizes from the original distribution.

Publication bias. The question answered through the analysis of publication bias is, “Are there a significant number of studies with null results that have not been uncovered through a search of the literature to nullify the effects found in the meta-analysis?” Two statistical approaches are generally considered acceptable for assessing this bias. Classic fail-safe N determines the number of null effect studies needed to raise the p value associated with the average effect above a specified level of α (in this case, 0.05). This test, performed in Comprehensive Meta-Analysis (Borenstein, Hedges, Higgins, & Rothstein, 2005), revealed that 30,539 additional studies would be required to nullify the effect. The second test, Orwin’s fail-safe N, estimates the number of missing null studies that would be required to bring the mean effect size of the meta-analysis to some specified trivial level. For the purpose, 0.10 was set as the trivial value. The number of missing null studies to bring the current mean effect to 0.10 is 388. Combined, these estimates suggest that there is little if any publication bias in this collection of effect sizes.

Methodological Features

Research designs. Table 4 shows the subgroup and total group effect sizes, and Table 5 contains the results of the heterogeneity analyses. The weighted means for subgroups ranged from +0.31 to +0.36. Although the variability within each of the subgroups is significantly heterogeneous, the variability explained among subgroups groups is not significant, Q _B(2) = 2.28, p = .32. Based on this, we decided to combine the data from all research designs in the subsequent analyses.

Abrami and Bernard (2008) discuss various approaches to dealing with study quality in systematic reviews. These approaches include treating study quality as an inclusion criterion, presenting and analyzing the results separately for each study quality category, weighting studies for quality, and testing and adjusting (as needed) for differences in effect size caused by study quality. The advantages of the latter approach include increasing review sample size, maximizing representativeness and generaliz-ability, and statistically controlling for differences attributable to design weaknesses.

Other methodological features. Two additional methodological study features were coded and analyzed: (a) type of CT measure and (b) method of extraction. Tables 6 and 7 show the breakdown of type of measure. Since Q _B was significant (indicating significant heterogeneity among levels), Bonferroni post hoc tests were conducted among pairs according to Hedges and Olkin (1985, p. 162). Each comparison is tested with z(γ̂) = γ̂/σ̂ ² compared with the 100(1 – α/2l) percentage point on the unit normal distribution, where 1 is the number of actual comparisons. This holds the family of comparisons at α = .05. Levels containing fewer than 10 effect sizes were considered too unstable for analysis, so teacher-made tests (k = 7) were not included. Results revealed that standardized measures, teacher- and researcher-made, and modified secondary-source measures were not different from one another but that they were all different from researcher-made measures.

Likewise, methods of effect size extraction were compared (see Tables 8 and 9). Extraction from descriptive statistics (i.e., means and standard deviations) and estimated with no assumptions (e.g., exact t values, exact probabilities) are more accurate than estimation with assumptions (e.g., p < .05 plus effect direction), leading most often to an underestimation of the true effect size. Although significant, this between-group effect (Q _B) was weaker than that for measures. Bonferroni comparisons revealed no differences among the levels.

Substantive Study Features

Four substantive study features were coded: (a) age of participants, (b) type of intervention, (c) pedagogical grounding of the intervention, and (d) presence or absence of collaboration.

Age. The results for age of participants are shown in Tables 10 and 11. The intergroup difference (Q _B) was significant, so post hoc comparisons were conducted. Again, categories with fewer than 10 effect sizes were not interpreted. The post hoc results revealed that elementary-age participants (ages 6–10) were not significantly different than secondary-age participants (ages 11–15) but that they were both significantly higher than undergraduate postsecondary students.

Type of intervention. The Q _B for type of intervention was significant, and post hoc analyses were performed. Mixed instructional approaches that combine both content and CT instruction significantly outperformed all other types of instruction. Immersion methods significantly underperformed all other approaches (see Tables 12 and 13).

Pedagogical grounding. This factor produced the strongest between-group effect (Q _B = 446.16), suggesting that the differences among the categories are the greatest. These results are shown in Tables 14 and 15. Examination of the means and post hoc analysis revealed that instructor training significantly outperformed the other three categories. In addition, the following differences were found (Category 2 > Categories 3 and 4; Category 3 > Category 4).

A follow-up regression analysis was performed to determine the joint contribution of intervention type and pedagogical grounding to overall variation. These two study features were dummy coded and run in hierarchical weighted multiple regression. The analysis revealed that in combination, these two instructional variables accounted for 32% of the total variation in effect size. However, Q _W remained heterogeneous.

Collaborative learning conditions. A final analysis of substantive study features was conducted on the basis of the presence or absence of collaborative learning conditions associated with each study and therefore each effect size. Tables 16 and 17 present the outcomes of this analysis. Although there is a 0.10 difference between the presence and absence of collaboration, and the Q _B of 9.25 is significant, this was considered a relatively minor difference compared with other instructional variables.

It should be reiterated that in all of these analyses of substantive study features, there was not one homogeneous mean effect size. This means that although the mean differences reported in this analysis are suggestive, they are by no means definitive.

Discussion

The data (161 effect sizes from 117 studies, including 27 true experiments) suggest a generally positive effect of instruction on students’ CT skills. However, the findings are not uniformly positive, and we found some evidence of negative effects. Nevertheless, given the complexity of the CT construct, we were surprised to find such positive effect sizes and are more curious now about understanding the data than when we began our investigation.

We explored several substantive features in an attempt to explain the variable findings. We were especially interested in instructional variables and found that both the type of CT intervention and the pedagogical grounding of the CT intervention contributed significantly and substantially to explaining variability in CT outcomes. Together, these two instructional variables explained 32% of the variance in effect sizes, meaning that improved CT skills and dispositions are associated with how CT instruction is provided.

Course content and the curriculum matter. Although all the subgroup means for the type of CT intervention were significantly greater than zero, they were not uniformly so. The mixed method, where CT is taught as an independent track within a specific content course (e.g., Boodt, 1984; Browne, Haas, Vogt, & West, 1977; Brownell, 1953; Crow & Haws, 1985; Hartman-Haas, 1984; Harty, Woods, Johnson, & Pifer, 1986; Klassen, 1983; Loesch-Griffin, 1986; Martin, Craft, & Sheng, 2001; Marzano, 1989; McCarthy-Tucker, 1998; Robinson 1987; Stoiber, 1991), had the largest effect, whereas the immersion method, where CT is regarded as a by-product of instruction, had the smallest effect. Moderate effects were found for both the general approach, where CT skills are the explicit course objective, and the infusion approach, where CT skill are embedded into the course content and explicitly stated as a course objective. The smallest effects, for the immersion method, occurred where CT skills were not an explicit course objective. Such indirect instruction is the least effective approach. Whether it is taught separately of content or embedded within content seems like a less important distinction empirically. This is an important finding for the design of courses. Making CT requirements a clear and important part of course design is associated with larger instructional effects. Developing CT skills separately and then applying them to course content explicitly works best; immersing students in thought-provoking subject matter instruction without explicit use of CT principles was least effective.

Pedagogy matters. When instructors received special advanced training in preparation for teaching CT skills (e.g., Arlo, 1969; Daley, Shaw, Balistrieri, Glasenapp, & Piacentine, 1999; Feuerstein, 1999; Hook, Jacobs, & Crisp, 1969; MacPhail-Wilcox, Dreyden, & Eason, 1990; Martin et al., 2001; McCarthy-Tucker, 1998; McConney, McConney, & Horton, 1994; Riesenmy et al., 1991; VanTassel-Baska et al., 2002; Zohar et al., 1994; Zohar & Tamir; 1993), or when extensive observations on course administration and instructors’ CT teaching practices were reported, the impacts of the interventions were greatest. By contrast, impacts of CT were smallest when the intention to improve students’ CT was only listed among the course objectives and there were no efforts at professional development or elaboration of course design and implementation. These results suggest that better outcomes can be achieved through active, purposeful training and teacher support both at the preservice and in-service levels. The results also implicate the design of courses in which CT skills are addressed, either generically or in concert with course content. To maximize impact requires both the willingness to incorporate CT instruction and explicit strategies and skills to do it effectively.

We also found that collaboration among students while developing CT skills appears to provide some advantage but that this effect is minor compared with other substantive instructional study features. Recently, theoretical work has appeared in the literature of distance and Web-based education relating CT and online collaboration (e.g., Fahy, 2005; Garrison, Anderson, & Archer, 2001; C. Perkins & Murphy, 2006). Because most of the empirical research that has appeared to date has been qualitative in nature, it is difficult to judge now what this adaptation of classroom practice will yield.

These findings make it clear that improvements in students’ CT skills and dispositions cannot be a matter of implicit expectation. As important as the development of CT skills is, educators must take steps to make CT objectives explicit in courses and to integrate them into both preservice and in-service training and faculty development. If the outcome is worth it, the effort is worth it. One unanswered question, of course, is whether results that are achieved in particular classrooms have a lasting or transitory effect and, in this sense, whether the outcome is worth the effort. It seems unlikely that a “hit-or-miss” approach to developing thinking skills, or any skill for that matter, will yield satisfactory results that extend very far into the future.

The next steps in our review of the evidence include even more complete and more detailed examinations of the evidence. There are more methodological and substantive features we wish to explore, including treatment duration, subject matter, student characteristics, instructor characteristics, and so on. We hope to be able to account for further variation in effect size based on these study features so that we can sort out the various conditions of intervention type, learner type, subject matter, and so on that better explain why high levels of success are achieved in certain classroom applications and not in others.

Arguably the most illusive and, at the same time, most important aspect of the work to be completed concerns the quality of CT interventions. Specifically, we wonder whether and to what extent implementation fidelity is related to effect sizes both overall and for specific CT interventions.

Our future work will extend from this Stage 1 review to a finer-grained analysis of methodological and substantive issues, and we will attempt to more completely answer the question: What are the specific elements of instructional practice that develop better CT and learning outcomes for students? Much like a single primary investigation cannot address every critical hypothesis, a single review may not address every question about the collection of evidence. Studies follow studies and so too should reviews follow reviews.

These initial results appear to contradict some of the previous reviews (e.g., McMillan, 1987) that have found that instructional interventions have little effect on the development of CT skills and dispositions. However, we cannot make final claims about the state of the population until the wide heterogeneity in study findings has been further reduced. We decided to report these intermediate results both to distribute the tentative findings to educators and scholars and also to get feedback while the work is in progress. We hope others find these preliminary data as interesting and compelling as we do. At the same time, we welcome comments and criticisms that may lead us to improve the quality of our undertaking.

If the learning of CT skills and the development of the disposition to think critically can be influenced through instruction, as it registers on posttest measures of CT, that is one thing of importance that can result from the continuation of this work. We hope to draw new attention to CT instruction in schools and possibly suggest some new directions for pedagogy, and given our wider interests, both with and without the use of technology (Bernard et al., 2004).

Otherwise, efficiency concerns need to be explored. In other words, we need to consider whether to change curricula to address development of CT skills, without neglecting subject matter content. Undoubtedly, we will encounter in this question the debate about fundamental educational competencies and where in this schema to place CT.

Finally, a more important question that may never be fully answered is the following: How much does learning to think critically in classroom settings lead to better citizens who are more discerning, people who are more analytical in their professions, or parents who can think carefully through the variety of choices that face them in raising a family in a complex and challenging world? That is the true test of CT instruction, specifically, and indeed, the true test of education in general.

Footnotes

An earlier version of this article was presented at the 2006 annual meeting of the American Educational Research Association in San Francisco. This study was supported by grants to Abrami and Bernard from Fonds québécois de la recherche sur la société et la culture and the Social Sciences and Humanities Research Council of Canada. The authors express appreciation to Anna Peretiatkovicz for her assistance and contributions.

Figure and Tables

References

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

57.

58.

59.

60.

61.

62.

63.

64.

65.

66.

67.

68.

69.

70.

71.

72.

73.

74.

75.

76.

77.

78.

79.

80.

81.

82.

83.

84.

85.

86.

87.

88.

89.

90.

91.

92.

93.

94.

95.

96.

97.

98.

99.

100.

101.

102.

103.

104.

105.

106.

107.

108.

109.

110.

111.

112.

113.

114.

115.

116.

117.

118.

119.

120.

121.

122.

123.

124.

125.

126.

127.

128.

129.

130.

131.

132.

133.

134.

135.

136.

137.

138.

139.

140.

141.

142.

143.

144.

145.

146.

147.

148.

149.

150.

151.

152.

153.

154.

155.

156.

157.

158.

159.

160.

161.

162.

163.

164.

165.

166.

167.

168.

169.

170.

171.

172.

173.

174.

175.