The application of the reverse fragility index to randomized controlled trials in hand surgery

Abstract

The reverse fragility index (RFI) is a tool used to measure the robustness of statistically non-significant outcomes by quantifying the minimum event changes required to achieve significance. This study applies the RFI and reverse fragility quotient (RFQ) to hand surgery randomized control trials (RCTs). A systematic review of 15 journals yielded 25 RCTs with 43 non-significant, dichotomous outcomes published from 2000 to 2024. The RFI was calculated using Fisher’s exact test and the RFQ was calculated by dividing the RFI by the sample size. The median RFI was 4 and the median RFQ was 0.056. Loss to follow-up was not clearly reported in 24% of studies and loss to follow-up exceeded the RFI in 56% of outcomes. Hand surgery RCTs demonstrate fragility in non-significant findings, with a small number of event changes potentially altering conclusions. The RFI and RFQ are useful measures that complement traditional p-values, offering more context in interpreting clinical trial results.

Keywords

randomized control trials randomized clinical trial RCT RFI statistics

Introduction

The efficacy of treatments in randomized controlled trials (RCTs) is evaluated using statistical significance, typically assessed with p-values. The commonly accepted threshold is a p-value of 0.05 (Freiman et al., 1978; Wasserstein and Lazar, 2016), indicating that the probability of an observed difference occurring by random chance is less than 5%, attributing the difference to the intervention (Thoma et al., 2004; Walsh et al., 2014). However, the interpretability of an RCT solely based on p-value can lead to misinterpretation (Fahmy et al., 2024). Some argue that an overreliance on p-values overshadows other critical aspects of a study that should be considered, such as study design, statistical power, loss to follow up and sample size (Rohrich et al., 2020; Samargandi et al., 2018; Zaniletti et al., 2022).

The fragility index (FI) is a quantitative tool used to assess the robustness of statistically significant findings by measuring the number of event conversions required to change a trial’s results from statistically significant to non-significant (Walsh et al., 2014). However, its application is limited to trials with statistically significant outcomes. The reverse fragility index (RFI) metric is used to assess null trials by quantifying the minimum number of event changes needed to convert statistically non-significant results into statistical significance (Khan et al., 2020). The reverse fragility quotient (RFQ) is an additional metric that takes the sample size into consideration by dividing the RFI by the sample size. The RFQ facilitates the comparison of reverse fragility among RCTs with different sample sizes.

The challenges of interpreting p-values are exacerbated by the inherently small sample sizes and low event rates of RCTs (Ioannidis, 2005; Khan et al., 2020; Thiese et al., 2016), particularly in resource-intensive and time-sensitive fields like hand surgery. These limitations increase the risk of type II errors (false negatives), whereby a larger sample size might reveal a statistically significant effect that smaller trials overlook. In hand surgery literature, even minor changes in outcomes for a few patients can significantly impact results, potentially reversing trial conclusions. The application of the RFI offers a novel way to reassess trials and identify overlooked interventions or therapeutics.

Metrics like FI and RFI provide a more nuanced framework for evaluating evidence from RCTs. In this study, we applied the RFI and RFQ to assess the robustness of statistically non-significant outcomes in hand surgery RCTs. Furthermore, we sought to identify study characteristics associated with RFI and RFQ values.

Methods

Search strategy

A systematic search of the MEDLINE database was conducted to identify RCT studies covering hand surgery topics published between 1 January 2000 and 30 November 2024. This time frame was chosen to capture a comprehensive range of studies in the last 25 years of hand surgery. Journals searched included Journal of Bone and Joint Surgery, Bone and Joint Journal (formerly Journal of Bone and Joint Surgery (British Volume)), Journal of Hand Surgery (American Volume), Journal of Hand Surgery (European Volume), Journal of Hand Surgery (Asian-Pacific Volume), Journal of Hand Surgery Global Online, Journal of Shoulder and Elbow Surgery, BMC Musculoskeletal Disorders, HAND, Clinical Orthopaedics and Related Research, Journal of Orthopaedic Surgery, Plastic and Reconstructive Surgery, Plastic and Reconstructive Surgery – Global Open, Journal of Plastic, Reconstructive & Aesthetic Surgery, Annals of Plastic Surgery. Table S1 shows reproducible search strategy.

Eligibility criteria

Two reviewers (RA and BP), who were trained in research methodology, independently conducted a review of all the RCTs and included studies that used a 1:1 parallel arm study design and contained statistically non-significant, dichotomous primary outcomes. Studies were excluded if they were non-randomized, pilot studies, if they employed multigroup, split-body, cluster, crossover or factorial designs or did not report primary endpoints. Outcomes were excluded if they were significant, non-dichotomous or secondary. The specific search strategy and inclusion and exclusion criteria are presented in Figure 1. Any discrepancies were resolved by joint review and discussion.

Figure 1.

Flow diagram of search strategy.

Data extraction

We extracted data using a standardized form that contained variables previously collected in similar investigations assessing the RFI of RCTs (Chin et al., 2019; Khan et al., 2020). Variables collected included study outcomes, clinical trial location, blinding type, type of enrollment (e.g. single- vs. multicentre), type of funding (e.g. government, private), sample sizes, event rates, loss to follow-up, follow-up duration, whether a power analysis was performed and acknowledgement of potential underpowering. For studies with multiple primary outcomes, all eligible outcomes were included. Secondary outcomes or outcomes not explicitly identified as primary were excluded.

Outcomes

The primary outcome of this study was the RFI and the RFQ presented as the median at the p = 0.05 threshold with corresponding interquartile ranges (IQR). Secondary outcomes included the proportion of RCTs in which loss to follow-up exceeded the RFI, and the analysis of associations between study characteristics and both RFI and RFQ.

Calculation of RFI and RFQ

The RFI was calculated by modifying the outcome events in a 2 × 2 contingency table. This was done by first identifying the group (e.g. intervention vs. control) with the fewer number of events, then iteratively subtracting events while simultaneously adding non-events to this group, maintaining the number participants to achieve a p-value less than 0.05 by Fisher’s exact two-sided test indicating that significance was achieved. The total number of event reversals required to achieve significance was recorded as the RFI (Baer et al., 2021; Khan et al., 2020). Figure 2 displays a 2 × 2 contingency table and demonstrates an example calculation in which two event reversals were required to reach statistical significance, corresponding to an RFI of 2.

Figure 2.

Sample reverse fragility index calculation with a 2 × 2 contingency table.

If the event count in the group with fewer cases reached zero before achieving significance, the process was reversed: events were iteratively added to the opposite group while removing non-events, again maintaining the total number of participants, until significance was achieved. In cases where the p-value is significant upon the initial recalculation with the Fisher’s exact test, the RFI is assigned a value of zero.

The RFQ for each outcome was calculated by dividing the RFI by the sample size. While the RFI is an absolute measure that does not account for sample size, the RFQ enables a comparison of the reverse fragility across RCTs with varying sample sizes. For example, in a study with a sample size of 100 and an RFI of 5, the RFQ would be 0.05, indicating that 5% of participants would need an event reversal to achieve significance.

Statistical analysis

The RFI and RFQ were reported as medians with IQRs. To determine any associations between study characteristics and both RFI and RFQ, we used Kruskal–Wallis test for categorical variables with three or more categories and Mann–Whitney U-test for categorical variables with two categories (Khan et al., 2020).

Results

Our search strategy resulted in 297 studies published in the 15 included journals. Following full-text review, 25 studies with a total of 43 statistically non-significant, dichotomous primary outcomes were included in the analysis (Figure 1). Nearly a quarter (24%) of the RCTs were published in Journal of Hand Surgery (European Volume) followed by Journal of Bone and Joint Surgery (16%) and Bone and Joint Journal (16%). Six of the 15 journals searched did not result in any RCTs that met study criteria. Of the 25 RCTs included, six did not report or clearly indicate loss to follow-up, seven did not perform power analysis calculations and 11 discussed potential study underpowering. Table 1 displays the characteristics of the included studies. Table S2 includes a full list of included studies as well as study and outcome characteristics.

Table 1.

Characteristics of included RCT studies

Characteristics of RCT study (n = 25)	n (%)
Journal
Journal of Hand Surgery (European Volume)	6 (24%)
Journal of Bone and Joint Surgery	4 (16%)
Bone and Joint Journal	4 (16%)
Journal of Shoulder and Elbow Surgery	3 (12%)
Journal of Hand Surgery (American Volume)	3 (12%)
Plastic and Reconstructive Surgery	2 (8%)
Clinical Orthopaedics and Related Research	1 (4%)
Journal of Plastic, Reconstructive & Aesthetic Surgery	1 (4%)
HAND	1 (4%)
Annals of Plastic Surgery	0
BMC Musculoskeletal Disorders	0
Journal of Hand Surgery (Asian-Pacific Volume)	0
Journal of Hand Surgery Global Online	0
Journal of Orthopaedic Surgery	0
Plastic and Reconstructive Surgery – Global Open	0
Region of origin
Asia	5 (20%)
Europe	10 (40%)
United States	6 (24%)
Others	4 (16%)
Enrollment type
Single centre	19 (76%)
Multicentre	6 (24%)
Funding type
Private	3 (12%)
Government	1 (4%)
No funding	7 (28%)
Not reported	14 (56%)
Blinding
Double blinded	5 (20%)
Single blinded	3 (12%)
No blinding	8 (32%)
Not reported	9 (36%)
Loss to follow-up
Reported	19 (76%)
Not reported	5 (20%)
Unclear	1 (4%)
Power analysis
Yes	18 (72%)
No	7 (28%)
Mention of possible underpowering
Yes	14 (56%)
No	11 (44%)

Abbreviations: RCT, randomized controlled trial.

The median RFI was 4 (IQR, 3–5), indicating that a change in event status for four participants would result in statistical significance. The distribution of RFI across 43 outcomes is shown in Figure 3. The median RFQ was 0.056 (IQR, 0.043–0.072). There were no outcomes with an RFI of 0. The median sample size was 89 participants (IQR, 53–102) with a median of eight events (IQR, 5–21.5). Loss to follow-up was clearly reported 76% of studies (19/25), accounting for 36 outcomes. Within those 36 outcomes with clearly reported loss to follow-up, the median loss to follow-up was 8 (IQR, 2–18.75), and 55.6% (20/36) had loss to follow-up that exceeded the RFI. The characteristics of analysed non-significant outcomes are presented in Table 2. The median RFQ was statistically higher for studies that did not conduct a power analysis than those that did (0.074 [IQR, 0.056–0.112] vs. 0.055 [IQR, 0.040–0.070]; p = 0.045). There were no other significant associations between RFI or RFQ and trial characteristics (Table S3).

Figure 3.

Distribution of reverse fragility index across 43 study outcomes.

Table 2.

Characteristics of analysed non-significant outcomes

Outcome characteristics (n = 43)	Median (IQR)
RFI	4 (3–5)
RFQ	0.056 (0.043–0.072)
Total sample size	89 (53–102)
Total events	8 (5–21.5)
Total loss to follow-up (n = 36)*	8 (2–18.75)
Loss to follow-up compared to RFI (n = 36)*
Loss to follow-up > RFI, n (%)	20 (55.6)
Loss to follow-up < RFI, n (%)	16 (44.4)

Abbreviations: IQR, interquartile range; RCT, randomized controlled trial; RFI, reverse fragility index; RFQ, reverse fragility quotient

Includes 36 outcomes with clearly reported loss to follow-up.

Discussion

In this systematic review of hand surgery RCTs, we assessed the RFI of 25 studies that included 43 statistically non-significant primary outcomes. This analysis revealed an overall median RFI of 4, which indicates that four event reversals are required to flip the statistically non-significant result to a significant result. The overall median RFQ was 0.056, indicating that that 5.6% of the sample size would need event reversals to achieve significance. The studies had a median of 89 total sample size with a median of eight losses to follow-up. These findings indicate fragility in hand surgery RCTs.

To our knowledge, no prior literature applied the RFI within hand surgery, although the FI has been reported. A systematic review of hand surgery articles showed a median FI of 3 with 40% of trials that had a loss to follow-up that exceeded the FI (Ruzbarsky et al., 2019). Previous literature evaluating the RFI across various topics in orthopaedic surgery has observed varying degrees of robustness. A systematic review that investigated RCTs on total joint arthroplasties showed a median RFI of 7 (range, 1–40) (Herndon et al., 2021), orthopaedic trauma showed a median RFI of 5 and median RFQ of 0.05 (Forrester et al., 2021), reverse total shoulder arthroplasty showed a median RFI of 4 (IQR, 3–5) and a median RFQ of 0.06 (IQR, 0.03–0.07) (Yendluri et al., 2024) and comparing rerupture rates in Achillies tendon repair versus rehabilitation showed a median RFI of 3 (IQR, 2–3) and a median RFQ of 0.03 (IQR, 0.02–0.06) (Bragg et al., 2024).

A priori power calculations and clearly defined primary outcomes are critical components of RCTs to yield robust and valid results. Without these elements, studies risk being underpowered, leading to invalid conclusions. Our findings revealed gaps in methodological rigour as 28% of studies analysed did not conduct power analysis, and 9% of studies with non-significant results failed to specify whether outcomes were primary or secondary, precluding their inclusion in our systematic review. The issue of underpowered clinical trials remains a prevalent challenge in medical research, often making it difficult to draw meaningful conclusions from non-significant studies that could otherwise have clinical significance (Chung et al., 2002). Although larger sample sizes can improve statistical power and reduce fragility, fields like hand surgery often have small sample sizes, making this approach less feasible. The RFI is an especially valuable tool to assess the robustness of findings in studies with limited sample sizes to help ensure they are not overlooked despite their potential clinical relevance.

Examining loss to follow-up is an essential step in understanding the potential biases that can influence clinical research findings, as its impact warrants closer scrutiny. Within our study, 24% of trials (6/25) of the included RCTs in this analysis did not clearly report loss to follow-up. This complicates interpretation of the results and underscores the need for improved reporting standards. Our study showed that the loss to follow-up exceeded the RFI in 56% of outcomes where loss to follow-up was clearly reported. Although it is possible that individuals lost to follow-up had outcomes similar to those who remained in the study, this assumption can introduce bias. Those lost to follow-up may have substantially different outcomes, and in extreme cases, their exclusion can skew study results to flip the interpretation of findings entirely (Bell et al., 2014). Therefore, study outcomes with loss to follow-up exceeding the RFI warrant closer attention, as the uncertain outcomes of these participants can significantly alter results, shift statistical significance, and ultimately influence clinical practice and management decisions (Dettori, 2011; Walsh et al., 2015). However, it is important to note that loss to follow-up that exceeds the RFI does not imply that statistical significance will be achieved if those patients were captured (Oeding et al., 2024). This assumption assumes a worst-case scenario whereby the missing data would flip the result of the significance. To improve transparency, studies should report the context of the number of participants who are lost to follow-up when reporting results of RCT findings.

The FI and RFI serve as complementary tools for evaluating the robustness of outcomes in RCTs; however, they are different aspects of statistical significance. The FI examines statistically significant outcomes with a more conservative lens, calculating the minimal number of event changes required to render the result non-significant. Conversely, the RFI identifies the minimum number of event changes needed to achieve statistical significance. Importantly, the RFI should not be misinterpreted as evidence that previously non-significant results inherently have an underlying significant outcome. Instead, it serves as a supplemental measure to the p-value and FI, offering additional context when interpreting the robustness and clinical relevance of both significant and non-significant findings. By integrating RFI with FI and p-value reporting, researchers can provide a more nuanced evaluation of trial outcomes, facilitating better-informed decisions in evidence-based practice

Several limitations should be noted in this study. First, the literature search was restricted to 15 journals within a defined time frame, potentially excluding other relevant RCTs in hand surgery from consideration. Second, the application of RFI is limited to parallel arm RCTs with non-significant dichotomous primary outcomes, and therefore, cannot be used to evaluate clinically important continuous outcomes, which are common in hand surgery studies. Third, it is important to note that RFI calculations should only be applied to outcomes with sufficient statistical power. Power analyses are typically conducted for primary outcomes, which are generally the only endpoints designed with adequate statistical power. In contrast, secondary outcomes are typically not powered and thus, RFI calculations cannot be applied to them. Lastly, the FI and RFI are calculated using Fisher’s exact test (Walsh et al., 2014), a method that may not be suitable for all datasets, particularly if the data do not meet the test’s underlying assumptions.

This study demonstrates the application of RFI and RFQ to hand surgery RCTs with statistically non-significant outcomes. Findings showed that outcomes are often fragile where non-significant results rest on a small number of event reversals. Reporting RFI and RFQ alongside p-values as well as clearly reported loss to follow-up data and power analysis provides clearer context for interpreting findings, particularly in studies with small sample sizes. Using these additional measures in studies improves transparency and strengthens evidence-based decision-making in hand surgery.

Supplemental Material

sj-pdf-1-jhs-10.1177_17531934251327880 - Supplemental material for The application of the reverse fragility index to randomized controlled trials in hand surgery

Supplemental material, sj-pdf-1-jhs-10.1177_17531934251327880 for The application of the reverse fragility index to randomized controlled trials in hand surgery by Rodney Ahdoot, Bhuvan Pottepalem, Trista M. Benítez, Chien-Wei Wang and Kevin C Chung in Journal of Hand Surgery (European Volume)

Supplemental Material

sj-pdf-2-jhs-10.1177_17531934251327880 - Supplemental material for The application of the reverse fragility index to randomized controlled trials in hand surgery

Supplemental material, sj-pdf-2-jhs-10.1177_17531934251327880 for The application of the reverse fragility index to randomized controlled trials in hand surgery by Rodney Ahdoot, Bhuvan Pottepalem, Trista M. Benítez, Chien-Wei Wang and Kevin C Chung in Journal of Hand Surgery (European Volume)

Supplemental Material

sj-pdf-3-jhs-10.1177_17531934251327880 - Supplemental material for The application of the reverse fragility index to randomized controlled trials in hand surgery

Supplemental material, sj-pdf-3-jhs-10.1177_17531934251327880 for The application of the reverse fragility index to randomized controlled trials in hand surgery by Rodney Ahdoot, Bhuvan Pottepalem, Trista M. Benítez, Chien-Wei Wang and Kevin C Chung in Journal of Hand Surgery (European Volume)

Footnotes

Acknowledgements

None.

Declaration of conflicting interests

The authors disclose the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Dr Chung receives funding from the National Institutes of Health and book royalties from Wolters Kluwer and Elsevier, and research funding from Sonex. Ms Benítez receives institutional support through a Diversity Supplement from the National Institutes of Health awarded to the section of Plastic Surgery at the University of Michigan. All other authors have no financial disclosures.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Ethical approval

Not applicable.

Informed consent

Not applicable.

ORCID iD

Kevin C Chung

Supplementary material

Supplemental material for this article is available online.

References

Baer

Gaudino

Charlson

Fremes

Wells

MT.

Fragility indices for only sufficiently likely modifications. Proc Natl Acad Sci USA. 2021, 118: 1–12.

Bell

Fiero

Horton

Hsu

CH.

Handling missing data in RCTs; a review of the top medical journals. BMC Med Res Methodol. 2014, 14: 118.

Bragg

Ruelos

VCB

McIntyre

, et al. Reverse fragility index comparing rates of rerupture after open achilles tendon repair versus early functional rehabilitation: a systematic review of randomized controlled trials. Am J Sports Med. 2024, 52: 1116–21.

Chin

Copeland

Gallo

, et al. The fragility of statistically significant randomized controlled trials in plastic surgery. Plast Reconstr Surg. 2019, 144: 1238–45.

Chung

Kalliainen

Spilson

Walters

Kim

HM.

The prevalence of negative studies with inadequate statistical power: an analysis of the plastic surgery literature. Plast Reconstr Surg. 2002, 109: 1–6; discussion 7–8.

Dettori

JR.

Loss to follow-up. Evid Based Spine Care J. 2011, 2: 7–10.

Fahmy

Colwell

Chung

KC.

p-Value reporting and reliability in plastic and reconstructive surgery: a primer for readers and investigators. Plast Reconstr Surg. 2024.

Forrester

McCormick

Bonsignore-Opp

, et al. Statistical fragility of surgical clinical trials in orthopaedic trauma. J Am Acad Orthop Surg Glob Res Rev. 2021, 5.

Freiman

Chalmers

Smith

Jr Kuebler

RR.

The importance of beta, the type ii error and sample size in the design and interpretation of the randomized control trial. Survey of 71 ‘negative’ trials. N Engl J Med. 1978, 299: 690–4.

10.

Herndon

McCormick

Gazgalis

Bixby

Levitsky

Neuwirth

AL.

Fragility index as a measure of randomized clinical trial quality in adult reconstruction: a systematic review. Arthroplast Today. 2021, 11: 239–51.

11.

Ioannidis

JP.

Contradicted and initially stronger effects in highly cited clinical research. Jama. 2005, 294: 218–28.

12.

Khan

Fonarow

Friede

, et al. Application of the reverse fragility index to statistically nonsignificant randomized clinical trial results. JAMA Netw Open. 2020, 3: e2012469.

13.

Oeding

Ayeni

Senorski

Zaffagnini

Grassi

Samuelsson

Are orthopaedic randomized controlled trials as statistically fragile as portrayed? A call for improved interpretation of the statistical fragility index. J Exp Orthop. 2024, 11: e12042.

14.

Rohrich

Agrawal

Savetsky

Avashia

Chung

KC.

When is science significant? Understanding the p value. Plast Reconstr Surg. 2020, 146: 939–40.

15.

Ruzbarsky

Khormaee

Daluiski

The fragility index in hand surgery randomized controlled trials. J Hand Surg Am. 2019, 44: 698.e1–e7.

16.

Samargandi

Al-Taha

Moran

Al Youha

Bezuhly

Why the p value alone is not enough: the need for confidence intervals in plastic surgery research. Plast Reconstr Surg. 2018, 141: 152e–62e.

17.

Thiese

Ronna

Ott

P value interpretations and considerations. J Thorac Dis. 2016, 8: E928–e31.

18.

Thoma

Farrokhyar

Bhandari

Tandan

Users’ guide to the surgical literature. How to assess a randomized controlled trial in surgery. Can J Surg. 2004, 47: 200–8.

19.

Walsh

Srinathan

McAuley

, et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a fragility index. J Clin Epidemiol. 2014, 67: 622–8.

20.

Walsh

Devereaux

Sackett

DL.

Clinician trialist rounds: 28. When RCT participants are lost to follow-up. Part 1: why even a few can matter. Clin Trials. 2015, 12: 537–9.

21.

Wasserstein

Lazar

NA.

The ASA statement on p-values: context, process, and purpose. Am Statist. 2016, 70: 129–33.

22.

Yendluri

Chiang

Linden

, et al. The fragility of statistical findings in the reverse total shoulder arthroplasty literature: a systematic review of randomized controlled trials. J Shoulder Elbow Surg. 2024, 33: 1650–8.

23.

Zaniletti

Devick

Larson

Lewallen

Berry

Maradit Kremers

p-Values and power in orthopedic research: myths and reality. J Arthroplast. 2022, 37: 1945–50.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.07 MB

0.24 MB

0.13 MB