Abstract
Study Design
Multicenter study with prospective collected data.
Objectives
This study investigates the relationship between individual score components and treatment success, defined by achieving Minimal Clinically Important Difference (MCID) thresholds in functional outcomes, using multicenter prospectively collected data. This work aimed to optimize the OF-Score while maintaining its original structure. By using outcome-oriented data, we refined this established decision-support tool to improve its predictive accuracy for successful clinical outcomes.
Methods
Data from 518 patients from the EOFTT study with osteoporotic vertebral fractures were analyzed. Only patients with clinically successful outcomes, defined by improvement beyond MCID thresholds in functional scores after conservative or surgical treatment, were selected from this cohort. Optimization was performed using a data-driven reweighting approach combined with structured clinical expert evaluation, adjusting variable weights within predefined limits to improve alignment with successful therapies.
Results
The subset data of 374 successfully treated patients were analyzed. Before optimization, the OF-Score showed an accuracy of 73%, with pain and mobilization being the most important parameters. After optimization, the OF-Score showed an accuracy of 80.7% (sensitivity 85.2%, specificity 71.2%). The number of nonapplicable (indifferent) therapy recommendations dropped from 144 (37%) to 72 (19%). A threshold of 5 points provided optimal discrimination between conservative treatment (≤5) and surgical treatment (>5).
Discussion
The optimized OF-Score, targeted weight adjustments, improves the alignment between clinical recommendations and treatment success. This refinement enhances the score’s predictive accuracy for treatment responders while maintaining the simple, practical structure.
Keywords
Introduction
Original Weights of the OF-Score and Limits Applied for Optimization. The Table Summarizes the Original Scoring System and the Weight Limits of the Variables in the Optimization Process. Only Integer Values Within These Limits Were Evaluated.
While 71% of patients received treatment in line with OF-Score recommendations, one-third were treated contrary to the score’s guidance. 4 The OF-Score showed a potential of theoretical accuracy of 80%. Moreover, the OF-Score and its component variables demonstrate strong validity and reliability.5,6 Some studies confirmed the score’s utility, while others raised concerns about its robustness.5,7,8
There is potential to further refine the clinical utility of the OF-Score. Previous evaluations mainly examined concordance between the recommended and applied therapy, rather than actual patient outcomes.3,5,9-11 The OF-Score has not yet been re-evaluated retrospectively in terms of treatment success. However, prior studies have shown that adherence to OF-Score recommendations is associated with significant improvements in patient outcomes.
In addition, the functional outcome and changes in functional outcome measures over time are essential when re-evaluating the clinical performance of a score. To distinguish statistically detectable from clinically meaningful improvements in patient-reported outcomes such as the Oswestry Disability Index (ODI), the concept of the Minimal Clinically Important Difference (MCID) is commonly applied and well-established in spine research.12-15
This study aims to assess the influence of individual OF-Score components on treatment choice and success, assessed by applying MCID thresholds for relevant outcome measures (ODI, VAS, Barthel Index, EQ5D-5L). Based on these findings, score modifications will be sought to improve agreement between the OF-Score therapy recommendation and clinically successful treatment, without fundamentally changing the score’s structure or adding new variables. The analysis focuses on patients with successful outcomes following either conservative or surgical therapy.
Methods
Data from 518 patients with osteoporotic spine fractures enrolled in the prospective, observational multicenter EOFTT study were used to evaluate the OF-Score and its parameters for therapy recommendations.3,4 By focusing exclusively on patients with clinically successful outcomes, this analysis seeks to identify the specific factors that drive effective treatment pathways, rather than merely documenting failures. The OF-Score uses key parameters, including fracture morphology (OF classification), severity of osteoporosis, deformity progression, pain, fracture-related neurological deficits, the ability to mobilize without assistance, and overall health status, to determine treatment recommendations for patients with osteoporotic vertebral fractures. Patients were evaluated at baseline prior to the treatment decision, at discharge, and at follow-up at least 6 weeks after treatment. Scheduled follow-ups were to be conducted at 6 weeks, 12 weeks, 6 months, and 12 months. The most recent available follow-up data were used for the analyses.
Successful therapy was defined based on the clinically functional outcomes (ODI, VAS-pain, Barthel index, EQ5D-5L) at the time of the follow-up examination, as follows:
The threshold for minimal impairment was set at 20% for the ODI, 16 and for pain, a VAS score ≤3 was used.17,18
The Barthel index measures the functional independence of patients, specifically in terms of mobility and daily activities. Patients with a Barthel index score between 80 and 100 have only minimal functional limitations, as they are generally able to perform most of the daily activities independently, including self-care, mobility, and using the toilet, with occasional support if needed. 19
The EuroQol Group (2015) outlines the detailed procedures for calculating and interpreting EQ5D-5L scores, including the translation of personal profiles into index values. 20 It is implied that index values between 0.7 and 1.0 represent health states that are not perfect but are still relatively good with minimal limitations. 20 The EQ5D-5L index value was set at 0.7, and the self-reported health status (EQ5D-VAS) was set at 70 or better.
Patients who showed only minor impairments in the ODI, low pain scores, and minimal limitations in the Barthel index or EQ5D-5L on the day of treatment decision and maintained these levels throughout the follow-up period were included. Additionally, patients who improved by the minimal clinically important difference (MCID) in these outcome variables were also included.
The MCID was calculated for all outcome variables on the day of treatment decision using a distribution-based method,21,22 as no validated anchor measures were available in this secondary analysis of a multicenter observational cohort. This method recommends a factor for the standard deviation of 0.3 for conservative estimates and 0.5 for moderate estimates of the MCID. 23 For our study, we used a factor of 0.5. A stricter MCID threshold was deliberately chosen to ensure that only clinically meaningful improvements were classified as treatment success. This corresponds to a medium effect size according to Cohen’s guideline (d=0.5), meaning that the patients needed to show a more substantial improvement to be considered as having a clinically detectable change.23,24
In summary, data were analyzed from patients who were considered successfully treated if they fulfilled one of two criteria concepts. First, patients with only minor impairment at baseline (ODI <20%, VAS pain ≤3, Barthel index >80, EQ5D-5L index >0.7, EQ5D-VAS >70) who maintained these levels throughout follow-up were included. Second, patients with relevant baseline impairment were included if they achieved clinically meaningful improvement during follow-up based on the minimal clinically important difference (MCID) of the respective outcome variables. Patients were only included if they met the criteria for at least four out of these five outcome variables. Additionally, only patients who showed an improvement equal to or greater than the MCID in all outcome variables were included. In other words, only patients who either improved or maintained a relatively good functional level were included; patients who deteriorated were excluded.
Treatments were categorized as conservative, cement augmentation only, or instrumentation.
Statistical Methods
Distributional differences between the OF-Score recommendation and the therapy performed were examined using Chi 2 tests. Differences in OF-Score variables and baseline functional outcomes between conservatively and surgically treated patients were analyzed using a general linear model for ordinal and continuous variables, and Fisher’s exact test for nominal data.
Binary discriminant analysis was applied to identify and compare variables using standardized canonical discriminant coefficients. High coefficients, whether positive or negative, indicate a strong contribution to group separation, while values near zero suggest minimal influence. A logistic regression model was used to assess the total explanatory power of the OF-Score variables on the day of the treatment decision. The logistic regression model quality was assessed using Nagelkerke’s pseudo-R2.
Optimization Process
The OF-Score was optimized through four sequential phases to determine the ideal weighting of its variables. Importantly, the final reduction to a streamlined set of key parameters was not a predefined constraint but emerged as a direct result of this data-driven optimization process. These phases comprised: (1) identification of variables with the highest discriminatory power for the performed therapy, (2) narrowing of the permissible weight ranges, (3) systematic evaluation of weight combinations, and (4) selection of the optimal solution. The optimization aimed to minimize misclassification rates and to eliminate cases resulting in ambiguous treatment recommendations.
Phase 1: Identification of Variables With the Highest Discriminatory Power
A discriminant analysis was performed to identify the variables with the greatest influence on treatment decision-making. Standardized canonical discriminant coefficients were used to determine the variables with the highest discriminatory power and to guide the subsequent narrowing of parameter ranges.
Phase 2: Narrowing of the Permissible Weight Ranges
In a second step, Microsoft Excel Solver (version 2021) was used to narrow the weight limits of the optimized OF-Score while preserving its basic structure. By restricting both the variables and their weighting ranges in advance, the number of possible solutions was reduced to a manageable range for subsequent analyses. For each of the previously identified variables, upper and lower bounds for integer weighting were defined. The optimization process considered only integer values within these predefined limits. The final applied bounds are presented in Table 1. All variables of the original OF-Score were retained for calculation of the total score and its resulting treatment recommendation. However, only the weightings of OF classification, pain, and mobility were varied, as these variables demonstrated the highest discriminatory power. The remaining variables were included using their original weights.
Phase 3: Systematic Evaluation of Weight Combinations
In Phase 3, all possible combinations of weights within the limits established for OF classification, pain, and mobility were systematically evaluated, resulting in a total of 60,000 solutions. For each solution, the OF-Score and its corresponding treatment recommendation were calculated for each patient and compared with the treatment actually performed. Accuracy, number of misclassified cases, sensitivity, and specificity were assessed for each weight combination.
Phase 4: Selection of the Optimal Solution
In the final phase, all solutions were reviewed and filtered, and the optimal solution was selected based on three predefined criteria: • overall accuracy, • a low number of misclassified patients, and • simplicity and feasibility for clinical use.
The final weighting scheme was confirmed by expert consensus, ensuring both statistical robustness and clinical feasibility. Following an initial Delphi-based consensus process, it was agreed to maintain the OF-Score as a simple and user-friendly tool. Therefore, the weights assigned to pain and mobility were required to be symmetrical (e.g., −4 and +4 or −2 and +2). Combinations with asymmetrical weights (e.g., −3 and +4 or −5 and +2) were excluded. The final solution was subsequently reviewed and confirmed by expert consensus within the study group.
Descriptive and inferential statistical analyses were performed using IBM SPSS Statistics Version 29.0 (IBM Corp., Armonk, NY, USA), with a significance level of p<0.05.
Results
Descriptive Data of the Analyzed Patients and the Variables Used in the OF-Score and the Performed Therapy. The Percentages Represent the Proportions Within the Respective Groups. Pain Values are Reported as Mean and Standard Deviation.
Descriptive Statistics of Functional Outcomes Including Mean, Standard Deviation, Range, and Corresponding MCID at the Time of Treatment Decision in Patients With Osteoporotic Spine Fractures.
Treatment and OF-Score
Of the 374 patients included, 310 (82.9%) received a definitive treatment recommendation according to the OF score, comprising 128 patients (34.2%) with a recommendation for conservative treatment and 182 patients (48.7%) with a recommendation for surgical treatment. In 64 patients (17.1%), the OF-Score yielded an indifferent treatment recommendation. Overall, 223 patients (71.9%) were treated in accordance with the OF-Score recommendation. The sensitivity and specificity of the OF score for predicting the treatment performed were 72.5% and 70.7%, respectively. A significant association between OF-Score recommendation and the treatment ultimately performed was observed (r phi=0.409, p<0.001). When including patients with an indifferent recommendation, the overall accuracy of the OF score for predicting successful treatment allocation was 59.6%. Among patients with a conservative recommendation, 70 of 128 patients (54.7%) received adherent conservative treatment. In contrast, adherence was higher in patients with a surgical recommendation, with 153 of 182 patients (84.1%) undergoing surgical treatment as recommended. In total, 151 patients (40.4%) received either a non-adherent treatment or had an indifferent OF-Score recommendation.
The mean OF-Score was 6.5±2.5 (range 0–14) and differed significantly between conservatively (4.8±2.4) and surgically (7.3±2.2) treated patients (p<0.001). Post hoc analysis by treatment subgroup also showed significant differences, with mean scores of 4.8±2.4 for conservative treatment, 6.2±1.8 for cement augmentation, and 7.9±2.2 for instrumentation (each p<0.001). Seven patients with neurological deficits underwent surgical treatment. One additional patient showed radicular neurological symptoms, which improved with conservative management, despite surgical treatment recommendation. On the day of treatment decision, the conservatively treated patients showed less pain and better functional outcome with p<0.001 (Table 3).
Standardized Canonical Discriminant Coefficients, Sorted by Their Value.
The logistic regression model achieved an overall accuracy of 80.7% in predicting the type of therapy performed (Nagelkerke’s pseudo-R2=0.548), with a sensitivity of 87.4% and a specificity of 69.5%. The logistic regression model demonstrated moderate explanatory power with a Nagelkerke’s pseudo-R2 of 0.594 (p<0.001). Using the three variables with the highest absolute standardized canonical discriminant coefficients, the logistic regression yielded an accuracy of 80.4%, with a Nagelkerke’s pseudo-R2 of 0.573 (p<0.001).
OF-Score Optimization
Results of the Optimized OF-Score, After the Change of Weight for Pain, Mobility, Using a Cut off Value in the OF-Score of 5 Points (≤5 Points: Conservative Treatment Recommendation; >5 Points: Surgical Recommendation).
Comparison Between the Original and Optimized OF Score. The Optimized Score Includes Only the Key Variables OF Classification, Pain, Mobility, and Fracture-Related Neurological Deficit, Whereas All Other Variables Were Omitted. Optimization Yielded a Cut-Off Value of ≤5 Points for Recommending Conservative Treatment and >5 Points for Recommending Surgical Treatment.
Discussion
A specific and individualized therapy for OVFs with predictable results would be desirable. The OF-Score has been developed to guide surgeons in their decision-making process. The aim of the study was to improve the accuracy of the OF-Score in the predicted therapy recommendation by simultaneously reducing the number of patients with incorrectly predicted therapy and eliminating the group of patients with indifferent therapy recommendations. Patients who had undergone successful therapy served as the patient group. After optimization of the OF-Score, two variants showed a significant improvement in accuracy. The achieved accuracy approaches the theoretical maximum of logistic regression, while keeping simplifications such as symmetry of the range of values, reduction of the number of variables, and elimination of indifferent therapy recommendation. The variables with the greatest influence on the treatment decision are the OF classification, pain, mobility, and neurological deficits, already shown recently [23]. It is particularly noteworthy that, despite reducing the OF-Score to four variables (OF classification, pain, mobility, and fracture-related neurological deficit), a higher accuracy can be achieved relative to the original classification.
The elimination of the group with an indifferent recommendation is a clinically relevant advance. This increases the selectivity and facilitates the use of the OF-Score, especially for less experienced practitioners. Previous studies had shown that adherence to the OF-Score recommendation was not always guaranteed in clinical practice [1, 24]. With improved accuracy, the score can now be used more consistently as a basis for decision-making.
A large proportion of the EOFTT cohort could be included in this study, which may indicate a generally successful therapeutic approach, even though the predictive accuracy of the OF-Score was initially limited. However, it should also be noted that 28% of patients from the EOFTT cohort did not meet the inclusion criteria of our study. A detailed analysis of this subgroup was not performed. Some of these patients were excluded due to missing functional outcome parameters at follow-up, leaving it unclear whether their treatment was successful or not. Additionally, it is reasonable to assume that some patients were not successfully treated. In future analyses, it would be of interest to identify factors associated with treatment failure.
A methodological aspect of our work is the use of the minimal clinically important difference (MCID) as a criterion for evaluating treatment success. For the ODI, thresholds between 12 and 18 points have been described in the literature,25-27 and for back pain between 1.2 and 4 points.12,25 There are also established MCID thresholds for EQ-5D, but these vary depending on the population. 28 This enabled us to ensure that not only statistically significant changes, but also clinically relevant changes were evaluated as therapeutic success. In addition, the concept of patient acceptable symptom state is becoming increasingly important, describing an acceptable symptom state for the ODI at values ≤ 25.25,29 Distribution-based methods for determining the MCID, such as those based on standard deviation or effect size, are well-established in studies involving patients with spinal conditions.21,22 This focus on MCID could further enhance the clinical usability of future score applications. In our work, we also provide MCID values for patients with osteoporotic vertebral fractures for the ODI, pain, and the EQ5D-5L to improve future threshold values for this patient group. Nevertheless, MCID thresholds should be interpreted with caution, as they are influenced by baseline severity, follow-up duration, and the methodological approach used for their derivation.
Pain emerged as the most influential factors in therapy decision-making within our multicenter cohort. The assignment of ±5 points in the OF-Score based on whether the pain threshold is exceeded or not represents a substantial weighting. Although this threshold was derived from our data and is statistically robust, it remains based on a highly subjective parameter
While the threshold and the weight are statistically supported, it is based on a subjective parameter, which may limit its generalizability. Although pain is a subjective parameter, it is widely used to describe functional impairment and has proven to be a meaningful and clinically relevant measure in various settings.5,30-32 Moreover, pain was assessed in general and not in relation to physical activity or load, which limits the clinical interpretability of the results. Although pain is subjective, it remains a central determinant of functional impairment and is commonly integrated into clinical decision-support tools. Future studies should aim to differentiate between resting and activity-related pain to allow for more nuanced evaluations. In addition, external validation in independent cohorts is necessary, especially considering possible differences in pain perception across populations. While the comprehensive methodological approach enhances reproducibility, the resulting complexity may itself constitute a limitation, potentially affecting the feasibility of implementation in everyday clinical practice.
Nevertheless, around 20% of patients were not correctly identified in the predicted therapy. This, together with the results of the logistic regression and discriminant analysis, indicates influencing factors that are not yet captured by the OF-Score. Psychosocial factors could play a central role here: anxiety and depression are common in patients with osteoporotic fractures and are associated with poorer functional outcomes in the ODI, EQ5D and Barthel.10,33,34 A prospective analysis showed that anxiety during hospitalization correlates with significantly poorer functional scores after therapy. 35 These findings suggest that taking psychosocial variables into account, could further improve the predictive accuracy of the OF-Score. Such factors could be also considered in the therapy process, to improve patient outcomes.
In addition to the improvement in the OF-Score achieved through a targeted combination of methods, there are limitations to our study. The number of patients with very mild (OF1) or very severe fractures (OF5) was low, so no reliable conclusions can be drawn for these subgroups. OF3 fractures occurred significantly more frequently, which can lead to the decision for conservative or surgical treatment in general—or a specific surgical treatment—being correspondingly variable, a factor that was not analyzed within the scope of this study. A fundamental selection bias may be possible, as patients with very mild symptoms may not have presented to the participating clinics because they were treated in outpatient practices. This discrepancy highlights the difference between real-world treatment patterns and outcome-based optimization and underscores the exploratory nature of the proposed modification. Nevertheless, despite the non-randomized treatment allocation, the majority of patients included in this study achieved clinically meaningful improvement or maintained a good functional status according to the predefined outcome thresholds. This does not necessarily imply that the performed treatment represented the optimal strategy for every individual patient, but it indicates that the applied treatment pathways were associated with generally favorable short-to mid-term outcomes within the investigated cohort. The follow-up period ranges from a minimum of 6 weeks to a maximum of more than one year, which can lead to bias. This should be taken into account, especially when evaluating long-term effects such as re-fractures, sagittal balance, or late complications. A further differentiation between augmentation procedures and instrumented stabilization was beyond the scope of this study but represents an important area for future investigation. The OF-Score’s generalizability is limited, as validation to date has focused primarily on Central European populations.2,3 Consequently, patients with clinical deterioration or unsuccessful treatment outcomes were excluded by study design, which may limit the generalizability of the findings. Comparative analyses between successfully and unsuccessfully treated patients may provide additional insights into predictors of treatment failure, but this was beyond the scope of the present optimization study. As optimization and evaluation of the score were performed within the same cohort, overfitting cannot be excluded. Therefore, the optimized score should be regarded as exploratory and requires validation in independent patient cohorts. In our study, the follow-up period showed considerable variability, which may have affected the assessment of clinically meaningful improvement. Long-term outcomes and complications, such as adjacent segment degeneration or junctional problems in instrumented cases, were not assessed in this study and require dedicated long-term follow-up studies.
Ultimately, the decision on treatment is always part of a participatory process in which patient preferences and individual expectations can be taken into account.36,37 Deviations from the score recommendation are, therefore, unavoidable and should not necessarily be considered misclassifications.
Despite these limitations, our study demonstrates that the OF-Score achieves significantly improved accuracy and practicality after adjustment. The OF score has clearly demonstrated its clinical utility, as it has guided the successful treatment of many patients, requiring only minimal modifications to enhance its precision. This refinement represents the next step in the development of the OF-Score, improving accuracy and usability while preserving its core principles. The optimized OF-Score supports treatment decisions aligned with clinical guidelines and adapted to the individual patient. So far, there is no other validated algorithm that helps to choose between conservative and surgical treatment in these patients, and the evidence is based on comparable methods used. A special aspect is the methodical use of MCID values. For the first time, the recommendation of the OF-Score is evaluated by results that are important for the patient. This can help to avoid surgeries that are not needed, lower the costs, and improve the recovery. These effects are not only important for each patient but also for the healthcare system. Better decisions can prevent wrong treatments, reduce the risk of needing care, and support return to daily life. While the current structure of the OF-Score remains unchanged for now, the proposed weight adjustments require prospective validation in an independent patient cohort. Future studies will determine if these refinements consistently improve outcomes before a formal update to the decision-support tool is implemented. At the same time, our analysis also points to existing inequalities in care, for example between outpatient and inpatient care or in international contexts. Future studies and the use of the score in practice should look at these problems more closely.
Conclusion
Our analysis of successfully treated patients revealed that pain and mobilization were the most important parameters of the OF-Score. Higher pain levels, higher OF classification, and reduced mobility were associated with a higher likelihood of surgical treatment. After optimization, the OF-Score showed an accuracy of 80.7% (sensitivity 85.2%, specificity 71.2%). The optimized OF-Score retained its simple structure while eliminating the indifferent treatment recommendation.
Footnotes
Author Note
Acknowledgments
The authors thank all participating centers of the EOFTT study and the members of the Working Group Osteoporotic Fractures of the German Society of Orthopedics and Trauma for their contribution and expertise.
ORCID iDs
Ethical Considerations
The EOFTT study was conducted in accordance with the Declaration of Helsinki and approved by the responsible institutional ethics committees of the participating centers.
Consent to Participate
Written informed consent was obtained from all patients prior to inclusion. The processing number of the ethics committee of the lead clinical center is: Ethics Committee of the Medical Association Saxony-Anhalt, file 31/17.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data supporting the findings of this study are available upon reasonable request to the corresponding author and after consultation with the Working Group Osteoporotic Fractures, Spine Section of the German Society of Orthopedics and Trauma.
Anonymity Statement
All identifying information related to authors, institutions, and ethics committees is provided on this title page and has been removed from the blinded manuscript.
