A two-stage framework by leveraging large language model for predicting clinical trial outcomes

Abstract

Clinical trials are essential for discovering new treatments and advancing medical knowledge. However, the high uncertainty of carrying out clinical trials often ends with ineffective results. Therefore, the accurate prediction of clinical trial outcomes has become a significant challenge. Numerous publicly accessible clinical trial reports have been discovered to be beneficial in alleviating this challenge but lack necessary annotations to be formal datasets for deep model training. To address the issue, this paper proposes to construct a new clinical trial dataset by extracting publicly available clinical trial reports from ClinicalTrials.gov and PubMed. In addition, a new two-stage method is proposed for the prediction of clinical trial outcomes across all trial phases. Specifically, our method first employs a prompt template combined with each clinical trial report to prompt a large language model to generate a concise summarization text containing essential information related to the clinical trial outcomes. Subsequently, this summarization text is utilized to train a classifier to predict the outcomes. Extensive experiments were conducted on the dataset, and our method was compared with several state-of-the-art classification models. The results showed that our method achieved the best performance in predicting clinical trial outcomes, especially using small amounts of training data under a data imbalance difficulty.

Keywords

clinical trial large language model text classification

1. Introduction

Clinical Trial (CT) is an essential procedure in advancing clinical research. It involves the verification of human volunteers or patients who meet the recruitment criteria employed by hospitals or research institutions.^1,2 The objective of a CT is to evaluate the safety and efficacy of a new intervention (e.g., a drug or a medical equipment) for a target-specific disease. Conducting clinical trials is highly cost and time consuming. According to statistics, a CT may cost over $2 billion and take more than ten years on average for a new cancer-treatment drug evaluation to complete all trial phases.³ However, there are uncertainties originating from factors such as drug safety and trial protocol design issues, which significantly increase the risk of loss of these substantial investments.^4,5 Hence, preventing trial failures or reducing loss risk are crucial issues in the field of clinical trial research. At the same time, there is a vast amount of clinical data available due to the rapid expansion of clinical trial records and the surge in published scientific literature. This presents an opportunity for machines to learn invaluable clinical insights by effectively leveraging such a wealth of information.

Earlier attempts have been proposed to improve the prediction of clinical success through various machine learning methods that rely on structured data from biomedical, chemical, or drug databases.^6,7 With the increasing digitization of clinical trial records and reports over the past decade, many online websites have been established to assist researchers in the clinical field. For instance, ClinicalTrials.gov is a website and online database containing clinical research studies and their outcomes. Similarly, PubMed is a freely accessible search engine that provides a wealth of clinical trial reports. These online data sources offer invaluable text data as summary or description of clinical trials. However, only a small number of studies^8,9 have utilized them for modelling and learning. For example, Follett et al.⁸ only extract distinctive words that describe trials completed successfully or terminated from ClinicialTrials.gov as supplementary data to assist in quantifying the risk associated with clinical trial termination. This underutilization of abundant online data sources results in missing the opportunity to leverage existing resources for improving clinical trial research.

In recent years, researchers have been exploring leveraging online data as the primary source to improve clinical trial research.^1,2,10–12 HINT¹ proposed to utilize structured data from web-based sources (e.g. trial documents, medical codes, etc.), notably data from ClinicalTrials.gov, capturing correlations within interactive networks to enhance the prediction of trial outcomes. Luo et al.² extracted publicly available clinical trials from ClinicalTrials.gov to automatically estimate the status of a clinical trial and find out possible failure reasons. While clinical trial data from ClinicalTrials.gov has been widely studied, leveraging clinical trial reports from PubMed to address the clinical trial outcome prediction task has been hardly explored. Compared to the trial information provided by ClinicalTrials.gov, the clinical trial reports provided by PubMed usually offer a more comprehensive range of information. These reports provide details on trial design, implementation, and the interpretation of results. Moreover, clinical trial reports from PubMed have undergone peer review, which ensures that they possess a certain degree of scientific validity and credibility. This starkly contrasts the potential lack of validation and review from other data sources, thereby reducing misleading information.

To bridge the research gap and accelerate the development of predicting clinical trial outcomes, clinical trial records and reports available on two platforms are explored: ClinicalTrials.gov and PubMed. Each clinical trial on ClinicalTrials.gov has a unique identifier, known as ClinicalTrials.gov ID, and a status that tracks current trial stage (e.g., complete, terminated, and withdrawn, etc.). These IDs can serve as keywords to retrieve corresponding clinical trial reports from PubMed, as shown in Figure 1(a). These reports contain detailed information about the clinical trial studies, including objectives, research background, assessment methods, and other relevant details, as shown in Figure 1(b). Intuitively, a learning model having the ability to accurately predict trial outcomes using a large corpus of clinical trial reports can be a significant driving force for clinical trial research. However, the existing data^1,2 lacks annotated clinical trial reports, making it difficult to train models that leverage the reports as input to predict clinical trial outcomes. In addition, clinical trial reports are proved to be useful for clinical trial outcome prediction. Therefore, an elaborately annotated clinical trial report dataset to achieve the goal is highly needed.

Figure 1.

(a) A simple example. A clinical trial report can be readily located within PubMed using the clinical trial identification (ClinicalTrials.gov ID) acquired from ClinicalTrials.gov. (b) A brief diagram of a clinical trial report. Briefly describe the various components and their roles in clinical trial reports.

To that end, this paper introduces a newly constructed dataset, named POCT (Prediction Outcomes of Clinical Trials), consists of 8,466 successful and 2,797 failed annotated clinical trials. In addition, a two-stage method is proposed to predict clinical trial outcomes across all trial phases. Specifically, in order to filter interference information and shorten text length, each clinical trial report is combined with a prompt template to ask the large language model to generate short text for summarizing the outcome of its associated clinical trial. Then, the generated summarization text is used to train a classifier to predict the outcome of the clinical trial.

The main contributions of this work are summarized as follows:

A new clinical trial dataset POCT that associates clinical trial reports and clinical trials is constructed for the first time for model learning.

A new method based on a two-stage strategy by leveraging a large language model to assist in fine-tuning a pre-trained model is proposed for predicting clinical trial outcomes.

Experiments by comparing with state-of-the-art text classification and summarization baseline methods demonstrate that our two-stage method is effectiveness.

2. Related work

2.1. Clinical trial outcome prediction

Clinical trial data analysis and mining is a crucial topic in clinical research, including clinical trial referential search,^13,14 clinical trial outcome prediction,^15–17 and automatic clinical trial matching.^18,19 For the task of clinical trial outcome prediction, early works^15–17 predicted the outcome of clinical trials by leveraging expert-crafted features based on machine learning. For example, Wu et al.¹⁷ devised a two-stage classification approach based on SVM to identify genes and genetic lesion statuses from cancer clinical trial documents. Gayvert et al.¹⁵ utilized a random forest model to predict the likelihood of toxicity in clinical trials directly. However, they were primarily focused on patient-level clinical trial outcome prediction rather than offering a comprehensive forecast regarding the overall trial success. To fill the gap, Qi et al.²⁰ predicted the pharmacokinetics of phase III by modelling subject-level data from phase II trials, employing a residual semi-recurrent neural network. At the same time, Lo et al.²¹ utilized statistical machine learning techniques for predicting drug approvals by leveraging drug development and clinical trial data. Unlike previous tasks solely focused on predicting specific clinical phases, the objective of Fu et al.¹ was to utilize drug molecular characteristics and trial protocol information to predict all clinical trial phases. Compared to other data sources lacking effective review, the publication of clinical trial reports experienced peer approval, ensuring a level of validity and credibility. Consequently, utilizing PubMed reports to predict clinical trial outcomes reduced the risk of generating misleading predictions, thus enhancing the reliability of predictions. Nonetheless, none of these prior efforts attempted to leverage clinical trial reports to predict the trial outcomes.

2.2. Text classification and summarization

The task objective of text summarization is to generate concise and summarizing content from original text. Text summarization task typically encompasses two categories: extractive and abstractive. Extractive summarization^22–24 directly selected important sentences or paragraphs from source text to create a summary. Conversely, abstractive summarization^25–27 utilized natural language generation models to create entirely new summaries that were not limited to specific sentences in original text. Benefiting from the rapid development of pre-trained language models, fine-tuned language models specifically designed for the abstractive summarization task^28,29 made significant progress. In recent years, these two types of summarization models gained widespread applications in addressing various issues in the medical field, such as medical question answering,³⁰ electronic health record summarization,²² and medical fact generation.³¹ Katsimpras et al.¹¹ utilized an extractive summarization model for predicting intervention approval. However, these methods relied on high-quality source text-summary text pairs for training. To address this issue, a method that combined prompt learning with a large language model³² was adopted to generate abstract summaries. The goal of text classification was to assign text into predefined categories or labels. With the rapid development of pre-trained language models, deep learning methods^33,34 based on pre-trained language models (e.g., BERT,³⁵ RoBERTa³⁶) achieved promising results in short text classification. However, these models had limitations in long text scenarios due to the maximum input text length set during the pre-training stage. This made them unsuitable for direct application in long text classification. To address this challenge, some researchers^37,38 truncated inputs in a fixed way to meet the length limitation of model input. Meanwhile, Beltagy et al.³⁹ proposed to optimize the Transformer⁴⁰ by leveraging an attention mechanism that scaled linearly with sequence length. In POCT dataset, the text length of clinical trial reports also has far exceeded the maximum length that the current classification model can handle. Therefore, in order to more effectively address the challenges faced by long text classification, a concise and effective two-stage framework classifier is proposed to address this issue.

3. The dataset POCT

A new dataset POCT is constructed by associating clinical trial data and clinical trial reports to predict clinical trial outcomes. The construction precedure primarily consists of three steps: 1) Clinical trial selection. All clinical trials are obtained from ClinicalTrials.gov and then filtered by a pre-defined filtering strategy (Section 3.1) to select eligible trials. 2) Clinical trial report filtering. The corresponding reports are retrieved (Section 3.2) from PubMed based on the selected clinical trials by filtering out non-conforming clinical trial reports. 3) Data association and annotation. The collected clinical trials and their corresponding reports are associated by NCT ID (ClinicalTrials.gov ID). The associated trial data is further annotated by referring their outcomes through human verification (Section 3.3). The main steps of the construction of POCT are shown in Figure 2.

Figure 2.

The main procedure of the POCT construction.

3.1. Clinical trial selection

Clinical trial documents are initially retrieved from a API (https://classic.clinicaltrials.gov/api/gui) provided by ClinicalTrials.gov, and a series of selection filters are applied to select high-quality trials. To ensure the predictability of trial outcomes, clinical trials of drug interventions are focused on, and trials with the following conditions are selected: 1) explicit trial status and 2) specified trial phase. As shown in Figure 3(a), clinical trials in ClinicalTrials.gov are labeled with various explicit trial statuses, including Completed, Terminated, Suspended, Recruiting, or the other, in which unknown or missing are implicit statuses that are treated as noise. There are also multiple experimental phases, as shown in Figure 3(b), in which unspecified phases, such as unknown or missing are also as noise. To study the trials with outcomes, clinical trials with explicit status as Completed are chosen, meaning that the trial is usually Ended, or Terminated, indicating early stop without resumption. Meanwhile, clinical trials marked as Unknown phase during the trial phase are filtered out. Following these filtering, a preliminary selection of 101,923 trials that meet the requirements has been made from 472,425 candidate clinical trials. Each trial consists of labels of NCT ID (ClinicalTrials.gov ID), status, and phase.

Figure 3.

Data analysis of clinical trials.

3.2. Clinical trial report filtering

The clinical trial reports are retrieved from PubMed using the NCT ID of each trial. However, not all clinical trials have clinical trial reports, especially when they are in Terminated status. Additionally, the formats of clinical trial reports may vary dramatically due to different writing styles of clinical trial researchers. Our priority is to find clinical trial reports that have sub-section headings such as Objective, Background, Methods, Results, and Conclusion. Generally, the Conclusion section contains the descriptions of the effectiveness of a clinical trial and thus is kept separately for the verification of clinical trial outcomes. Eventually, a total of 11,263 clinical trial reports are acquired from collected clinical trial reports. Each clinical trial report consists of labels of NCT ID (ClinicalTrials.gov ID), PmID, title, objective, background, methods and results. Table 1 shows the combined criteria for clinical trial selection and clinical trial report filtering.

Table 1.
The criteria for clinical trial selection and clinical trial report filtering.

Criteria Description

NCT ID Ensure clarity and identifiability of the NCT ID of a clinical trial.

Explicit trial status Select clinical trials with clear trial statuses including Completed and Terminated.

Specified trial phases Filter out clinical trials with unspecified trial phases.

Drug interventions Priority is given to clinical trials focusing on drug interventions.

Existence of clinical trial reports Select clinical trial reports retrieved based on the NTC ID.

Required section of clinical trial reports Select clinical trial reports with headings of sub-sections: Objective, Background, Method, Results.

Conclusion section separation Separate conclusion section of clinical trial reports human annotation verification.

Criteria	Description
NCT ID	Ensure clarity and identifiability of the NCT ID of a clinical trial.
Explicit trial status	Select clinical trials with clear trial statuses including Completed and Terminated.
Specified trial phases	Filter out clinical trials with unspecified trial phases.
Drug interventions	Priority is given to clinical trials focusing on drug interventions.
Existence of clinical trial reports	Select clinical trial reports retrieved based on the NTC ID.
Required section of clinical trial reports	Select clinical trial reports with headings of sub-sections: Objective, Background, Method, Results.
Conclusion section separation	Separate conclusion section of clinical trial reports human annotation verification.

3.3. Data association and annotation

After collecting eligible clinical trials and clinical trial reports, the two types of data are associated using the common field NCT ID as the linking identifier. Afterwards, each data entry in the dataset consists of the following fields: NCT ID, phase, status, PmID, title, objective, background, methods, and results. After the association, the data annotation approach HINT¹ is adopted to improve the alignment of prediction outcomes with real-world scenarios. Trials with a status of Terminated are labeled as failure. Trials with a status of Completed are divided into two categories: completed trials that achieve the trial objectives are annotated as success, while completed trials that have not achieved expected objectives are annotated as failure. The annotations are further verified by three human experts by referring the conclusion sections of the trial reports to ensure their correctness.

Finally, the resulting dataset POCT consists of 8,466 trials labeled as success and 2,797 trials labeled as failure. The dataset is randomly divided into three parts for training, developing, and testing, with 7,883 trials, 1,690 trials, and 1,690 trials for training, developing, and testing, respectively. Each sub-dataset maintains a balanced ratio of positive to negative trial samples at approximately 3:1. The statistical information and data sample of the POCT dataset are shown in Tables 2 and 3, respectively. Additionally, data from phase I, phase II, and phase III trials are extracted from the POCT dataset to construct three datasets for phase-level clinical trial outcome prediction. The sizes of these datasets are 792, 3202, and 6,195 trials for phase I, II, and III, respectively. The positive to negative sample ratios for these datasets are 4:1, 7:3, and 3.5:1. For the phase-level prediction, the distribution of training, development, and testing data is maintained at approximately 7:1.5:1.5.

Table 2.
The statistical information of the POCT dataset.

Label Training Developing Testing Total

Success 5,926 1,270 1,270 8,466

Failure 1,957 420 420 2,797

Total 7,883 1,690 1,690 11,263

Label	Training	Developing	Testing	Total
Success	5,926	1,270	1,270	8,466
Failure	1,957	420	420	2,797
Total	7,883	1,690	1,690	11,263

Table 3.

A data sample of the POCT dataset.

Data fields	Data values
NCT ID	NCT02659397
Phase	phase 2
Status	Completed
PmID	31202641
Title	Complementary low-density lipoprotein-cholesterol lowering and pharmacokinetics of adding bempedoic acid (ETC-1002) to high-dose atorvastatin background therapy in hypercholesterolemic patients: A randomized placebo-controlled trial
Objective	The aim of the study was to assess the low-density lipoprotein cholesterol (LDL-C)-lowering efficacy of bempedoic acid added to stable high-intensity atorvastatin background therapy and multiple-dose plasma pharmacokinetics of atorvastatin alone and combined with steady-state bempedoic acid.
Background	Bempedoic acid is an oral, once-daily, first-in-class medication being developed to treat hypercholesterolemia.
Methods	This was a phase 2 study in patients with hypercholesterolemia. Patients received once-daily open-label atorvastatin 80 mg for 4 weeks then were randomized 2:1 at baseline to receive double-blind bempedoic acid 180 mg (n $=$ 45) or placebo (n $=$ 23) plus open-label atorvastatin 80 mg for 4 weeks. Efficacy was assessed 4 weeks after randomization. Atorvastatin and metabolites’ steady-state levels were analyzed before first dosing with bempedoic acid and after 2 weeks of treatment.
Results	The 4-week stabilization phase with 80 mg atorvastatin resulted in approximately 40% lowering of LDL-C values from screening. The placebo-adjusted least squares mean lowering of LDL-C from baseline to Day 29 with bempedoic acid was 22% (P $=$ .003). Placebo-adjusted reductions from baseline with bempedoic acid also were significant for total cholesterol ( $-$ 10%; P $=$ .014), non-high-density lipoprotein cholesterol ( $-$ 13%; P $=$ .015), apolipoprotein B ( $-$ 15%; P $=$ .004), and high-sensitivity C-reactive protein ( $-$ 44%; P $=$ .002). Point estimates of bempedoic acid effects on steady-state atorvastatin and ortho-hydroxy atorvastatin area under the curve were <30% and not clinically meaningful.
Outcome	Completed, Positive outcome/primary endpoint(s) met.
Label	Success

4. The Two-stage method for clinical trial outcome prediction

A two-stage method is proposed for predicting clinical trial outcomes. This method consists of two main modules: 1) a summarization module, which generates concise summarization text for clinical trial reports, and 2) a prediction module, which analyzes the generated summarization text to predict clinical trial outcomes. The overall architecture is illustrated in Figure 4. Firstly, each section of a clinical trial report is formatted into key-value pairs using a linearization layer. Secondly, prompt templates combined with the key-value pairs are input to Gemini-Pro to generate summarization text for the reports. Finally, summarization text is fed into a Longformer-based classifier to predict clinical trial outcomes.

Figure 4.

Overview architecture of the proposed method for clinical trial outcome prediction. and represent frozen and tunable weights during tuning, respectively.

4.1. The summarization generation

To generate a brief and precise summary for a given clinical trial report, the key challenge is identifying and understanding the critical content in the report. This requires the summarization model to have strong biomedical, commonsense, and numerical reasoning abilities. Additionally, the text length of clinical trial reports surpasses the capacity of most pre-trained language models to handle. To address these challenges, the Large Language Models (LLMs) equipped with the capabilities of powerful reasoning and ultra-long text handling is chosen.

Gemini-Pro³² is utilized as a summarization model, which is a large language model recently developed by Google. It is employed to generate summarization text for clinical trial reports. Specifically, each trial report is first formatted into key-value pairs. After that, a prompt concatenate with the key-value pairs is used to ask Gemini-Pro to generate a summarization text for an input. The first two steps, including linearization and prompting, are described below:

Linearization. In linearization, each section content $s_{i, k}$ of a trial $c_{i}$ and a section name $k$ are formatted into a key-value pair. In this pair, the key of each element represents the section name $k$ and the corresponding value $s_{i, k}$ . This can be formulated as equation (1), where $k$ represents one of the sections of objective, background, methods and results.

Linearize (c_{i}) = {c_{i, k} : s_{i, k}},

(1)

Prompting. As shown in Figure 4, the prompts utilized to communicate with the Gemini-Pro consist of two components: a prompt template $T$ to guide Gemini-Pro on how to use natural language to summarize trial reports under set conditions, and the linearization. This process can be formulated as equation (2), where $c_{i}$ represents the $i_{t h}$ clinical trial report, $T$ is the prompt template, and $P_{i}$ refers to the prompt corresponding to $c_{i}$ .

P_{i} = (T; Linearize (c_{i}))

(2)

Given the prompt $P_{i}$ , the Gemini-Pro generates a concise summarization text for the clinical trial report $c_{i}$ , which can be computed as equation (3), where $X_{i}$ refers to the summarazation text corresponding to $c_{i}$ .

X_{i} = Gemini - Pro (P_{i})

(3)

4.2. The outcome prediction

The clinical trial outcome prediction is modelled as a binary classification task. After processing by the summarization module, each clinical trial has a corresponding summarization text. Longformer tokenizer is adopted to convert words of each summarization text into tokens represented as numerical vectors. For each summarization $X$ , it is formulated by the tokenizer as $\bar{X} = {x^{c l s}, x_{1}, x_{2}, \dots, x_{i}, \dots, x_{l}}$ , where $x^{c l s}$ denotes a special classification token, $x_{i}$ represents the $i_{t h}$ token in the summary, and $l$ indicates the length of the summary. Then, Longformer is adopted as an encoder by taking $\bar{X}$ as input to obtain its global representation. The Longformer consists of a stack of 12 encoder layers, each containing a global + sliding window attention mechanism with 8 attention heads, a feed-forward layer with GeLU activation and layer normalization, and a residual connection. The Longformer processes input sequentially through its layers, as described below:

\begin{aligned} {\bar{X}}_{0} & = \bar{X} + X_{pos} \\ Y_{k} & = {\bar{X}}_{k - 1} + LN (G_{S} ({\bar{X}}_{k - 1})), \\ {\bar{X}}_{k} & = Y_{k} + LN (FFN (Y_{k})) \end{aligned}

(4)

where

X_{pos}

represents the position embedding.

G_{S}

denotes the global + sliding window attention. LN and FFN represent the layer-normalization layer and feed-forward layer, respectively.

{\bar{X}}_{k}

is the output of the

k_{t h}

encoder. When

k

equals to 12,

{\bar{X}}_{12}

represents the final output of Longformer.

Then, the representation of the $x^{c l s}$ in ${\bar{X}}_{12}$ serves as the global representation of the input $\bar{X}$ , which can be computed as equation (5), where ${\bar{x}}^{c l s}$ represents the global representation of $\bar{X}$ .

{\bar{x}}^{c l s} = {\bar{X}}_{12} [0]

(5)

Finally, the classification layer is adopted to process the global representation ${\bar{x}}^{c l s}$ generated by Longformer to predict the outcome of the clinical trial. This computation is formulated as equation (6),

p = Softmax (W_{2} (ReLU (W_{1} {\bar{x}}^{c l s} + b_{1})) + b_{2}),

(6)

where Softmax is used to convert raw scores into a probability distribution,

W

and

b

represent the weight matrix and bias vector of the fully connected layer, respectively. ReLU is the rectified linear unit activation function.

p

represents the probability value that predicts the success of the clinical trial. If

p

is more significant than 0.5 , the clinical trial is considered as success. Otherwise, it is considered a failure.

4.3. Training objective

During training, the gradients of Longformer are updated while keeping the parameters of Gemini-Pro frozen. The training objective is to minimize cross-entropy loss is as equation (7),

L_{c e} = - \sum (y \log p + (1 - y) \log (1 - p)),

(7)

where

y

denotes the ground-truth label in the dataset and

p

represents the probability value.

5. Experiments and results

5.1. Settings

Baseline Methods. A list of baseline methods is utilized for performance comparison on the POCT dataset.

BERT³⁵ was based on the Transformer architecture and was capable of unsupervised pre-training on largescale text data.

RoBERTa³⁶ was built upon BERT by using larger training datasets, longer training times, and dynamic masking strategies, enhancing performance and robustness.

DeBERTa⁴¹ introduced disentangled attention mechanisms and a new decoder architecture on top of BERT for improving model performance and generalization capabilities.

Due to the limited input length of the above models, it is infeasible to input the entire clinical trial report into the models at once. To tackle this, a block processing approach has been adopted that considers the characteristics of clinical trial reports. The clinical trial reports are firstly split into several sub-reports, in which each sub-report is fed to the models to produce feature representations. Finally, the feature representations of each sub-report are concatenated to predict trial outcome.

Longformer³⁹ introduced an attention mechanism that scaled linearly with sequence length, making it easy to process documents of thousands of tokens or longer.

BART²⁸ was a transformer encoder-decoder model with a bidirectional encoder and an autoregressive decoder. It introduced a denoising autoencoder to pre-train sequence-to-sequence models. For generation tasks, the noising function was text infilling, which used single mask tokens to mask randomly sampled text spans.

Pegasus⁴² was a sequence-to-sequence model with the same encoder-decoder model architecture as BART. It employed gap-sentences generation as a pre-training objective tailored for abstractive text summarization.

Gemini-Pro³² was built on an enhanced Transformer decoder, employing an efficient attention mechanism that supported context lengths of up to 32 K . Zero-shot testing of Gemini-Pro was conducted on the POCT dataset with a provided prompt, as depicted in Figure 5.

Figure 5.

Zero-shot Gemini-Pro prompt.

Evaluation Metrics. With respect the imbalance of data in the POCT dataset, Weighted F1, Macro F1, and Micro F1 scores are employed to evaluate the classification performance. The metrics are formulated as follows:

\begin{aligned} F 1 & = \frac{2 \times Precision \times Recall}{Precision + Recall} F 1_{weighted} = \frac{\sum_{i = 1}^{C} N_{i} \times F 1_{i}}{N} \end{aligned}

(8)

\begin{aligned} F 1_{micro} & = \frac{2 \times T P_{total}}{2 \times T P_{totala} + F P_{total} + F N_{total}} F 1_{macro} = \frac{1}{C} \sum_{i = 1}^{C} F 1_{i} \end{aligned}

(9)

Precision is calculated as the ratio of true positive predictions to the sum of true positive and false positive predictions, while Recall is calculated as the ratio of true positive predictions to the sum of true positive predictions and false negative predictions. $C$ represents the number of classes and $N_{i}$ is the number of samples in class $i . F 1_{i}$ represents the F1 score for class $i$ , and $N$ represents the total number of samples. $T P_{total}$ is the total number of true positives, $F P_{total}$ represents the total number of false positives, and $F N_{total}$ represents the total number of false negatives.

Implementation Details. Gemini-Pro is adopted as the backbone network for generating summarization and Longformer is employed for predicting outcomes. The baseline models utilized are as follows: BERT-base-uncased, RoBERTa-base, DeBERTa-base, Longformer-base-4096, BART-finetuned-summarization-PubMed, and BigBird-Pegasus-Large-PubMed. All deep learning-based models are deployed using PyTorch⁴³ and an NVIDIA RTX 3090 GPU. The AdamW⁴⁴ optimizer is utilized for all the models. Throughout the training process, the initial learning rate is set at $1 e - 5$ , with adjustment governed by the cosine annealing rule. The maximum epoch count is set to 20 , and the number of rounds for the early stop mechanism is set to 5 . Under this setting, the average result on the test set is achieved over three random seeds.

For BART or Pegasus, a two-stage training strategy is employed by combining them with Longformer. Firstly, BART or Pegasus generates a summarization text for the trial report. During training, evaluation metrics such as ROUGE-1, ROUGE-2, and ROUGE-L are used to determine the quality of the generated summary texts. To ensure consistency between the training and prediction stages, a cross-validation strategy is utilized to generate text summaries for all data in the dataset. The generated summary texts are input into the Longformer model for outcome prediction.

5.2. The results on clinical trial outcome predication

The performance comparison of our method and the baseline methods is presented in Table 4. Our method demonstrates the best performance on most of the evaluation metrics. Compared to Gemini-Pro, our method improved with an average performance increase of 1.24% on 3 out of 4 metrics. Specifically, our method improves Weight F1 by 0.93%, Micro F1 by 2.19%, and F1 by 2.13% compared to Gemini-Pro, indicating its robust classification capabilities. Longformer, which excels in handling long sequences, achieves better performance than other one-stage models like BERT, RoBERTa, and DeBERTa. However, our method still outperforms Longformer on all metrics, achieving an average improvement of 0.55%. This demonstrates that Longformer is adept at capturing long-range dependencies, while our method effectively leverages both the summarization strengths of Gemini-Pro and the contextual reasoning capabilities of Longformer. Furthermore, in comparison to other models utilizing two-stage training strategies, such as BART+Longformer and Pegasus+Longformer, our method exhibits substantial performance increase, with a maximum average improvement of 5.73%. This significant enhancement emphasizes ability of our model to synergize the summarization prowess of generative models with the detailed reasoning capability of transformer-based models, leading to state-of-the-art results in predicting clinical trial outcomes. The results clearly indicate that our proposed method not only achieves the best performance on the POCT dataset but also sets a new benchmark by effectively integrating the strengths of both summarization and reasoning techniques.

Table 4.
The performance comparison of the methods on the POCT dataset.

Type Model Weight F1 Micro F1 Macro F1 F1 Avg.

One-Stage BERT 78.73 78.76 71.49 85.88 78.71

RoBERT 77.06 79.46 66.69 87.32 77.76

DeBERT 76.71 79.17 66.16 87.14 77.29

Longformer 79.95 80.95 71.98 87.83 80.17

Two-Stage BART+Longformer 75.75 78.57 64.51 86.85 76.42

Pegasus+Longformer 74.86 75.50 65.56 84.06 74.99

LLM Gemini-Pro 79.55 79.34 72.92 86.11 79.48

Two-Stage Our 80.48 81.53 72.63 88.24 80.72

Type	Model	Weight F1	Micro F1	Macro F1	F1	Avg.
One-Stage	BERT	78.73	78.76	71.49	85.88	78.71
	RoBERT	77.06	79.46	66.69	87.32	77.76
	DeBERT	76.71	79.17	66.16	87.14	77.29
	Longformer	79.95	80.95	71.98	87.83	80.17
Two-Stage	BART+Longformer	75.75	78.57	64.51	86.85	76.42
	Pegasus+Longformer	74.86	75.50	65.56	84.06	74.99
LLM	Gemini-Pro	79.55	79.34	72.92	86.11	79.48
Two-Stage	Our	80.48	81.53	72.63	88.24	80.72

5.3. The ablation study

In the POCT dataset, clinical trial reports consist of four main sections: objective, background, methods, and results. Notably, the methods and results sections occupy a significant portion of the content. Experiments are conducted to investigate the importance of methods and results sections in clinical trial outcome prediction. Specifically, experiments are carried out by removing them one by one to observe the impact of changes on overall performance. The model trained with all sections (as ALL) serves as the baseline. The results and methods sections are denoted as $R$ and $M$ , respectively. The $w/o R$ represents removing results section from all sections. And, the $w / o$ $M_{R}$ represents removing methods and results sections from all sections. The results of the ablation study are shown in Table 5. Compared to the baseline ALL, removing the results section decreases average performance by 0.32%. Furthermore, removing both the results and methods sections leads to a dramatic decrease in average performance by 6.55%. While removing the results section leads to a performance decline, it remains acceptable considering additional computational cost by adding the section as input. Additionally, the results section contains various detailed information about the trials, such as medication timing, dosages, and mathematical calculations, which pose challenges for language models.

Table 5.
The ablation study on the POCT dataset.

Section Weight F1 Micro F1 Macro F1 F1 Avg.

ALL 80.48 81.53 72.63 88.24 80.72

w/o R 80.17 81.18 72.27 87.99 80.40

w/o $M_{R}$ 73.84 75.32 63.24 84.31 74.17

Section	Weight F1	Micro F1	Macro F1	F1	Avg.
ALL	80.48	81.53	72.63	88.24	80.72
w/o R	80.17	81.18	72.27	87.99	80.40
w/o $M_{R}$	73.84	75.32	63.24	84.31	74.17

5.4. The results on phase level trial outcome prediction

Clinical trials are required to be divided into multiple trial phases, and predicting the success of a trial at the earliest phase has significant meaning for clinical research. To test the effectiveness of our method in different phases, separate models are trained using training instances from corresponding phases. Our method is compared with several baseline methods, and the results are shown in Table 6. Our method exceeds all baseline methods on phase-level trial outcome prediction. When comparing performance across all phases, our method improves by 1.45%, 3.17%, and 11.87% compared to Gemini-Pro, Longformer, and BERT, respectively. Our method outperforms other baseline models during phase I and II trials. Compared to the suboptimal model Gemini Pro, our method improves by 1.60%, 1.19%, and 1.54% in phase I, II, and III trials, respectively. This demonstrates that leveraging data generated by LLMs to assist in fine-tuning pre-trained language models can achieve better performance, particularly in scenarios with limited training data and class imbalance.

Table 6.
Performance on phase level outcome prediction of various approaches.

Methods

Phase Data Metrics BERT Longformer Gemini-Pro Our

Phase I Weight F1 68.82 68.82 76.64 77.60

Micro F1 78.33 78.33 75.00 80.00

Macro F1 43.92 43.92 68.65 64.00

F1_Score 87.85 87.85 82.75 88.00

Phase II Weight F1 64.04 76.28 78.91 80.02

Micro F1 72.55 77.33 78.79 80.04

Macro F1 51.48 70.91 75.20 75.90

F1_Score 83.45 84.58 84.63 86.37

Phase III Weight F1 68.00 86.20 83.30 84.81

Micro F1 77.74 86.45 83.24 84.62

Macro F1 43.73 79.71 75.93 78.33

F1_Score 87.47 91.40 89.19 90.06

Average 68.94 77.64 79.35 80.81

		Methods
Phase I	Weight F1	68.82	68.82	76.64	77.60
	Micro F1	78.33	78.33	75.00	80.00
	Macro F1	43.92	43.92	68.65	64.00
	F1_Score	87.85	87.85	82.75	88.00
Phase II	Weight F1	64.04	76.28	78.91	80.02
	Micro F1	72.55	77.33	78.79	80.04
	Macro F1	51.48	70.91	75.20	75.90
	F1_Score	83.45	84.58	84.63	86.37
Phase III	Weight F1	68.00	86.20	83.30	84.81
	Micro F1	77.74	86.45	83.24	84.62
	Macro F1	43.73	79.71	75.93	78.33
	F1_Score	87.47	91.40	89.19	90.06
Average	68.94	77.64	79.35	80.81

5.5. The results on predicting cross-phase transition

Clinical trials are conducted in phases, and each phase builds on the previous one. Therefore, the transfer ability of our method across phases is required to be systematically evaluated. At each transition, only the data from the previous phase is used for training. For example, when moving from phase I to phase II, only the data from phase I is used. The performance of the cross-phase transfers is shown in Figure 6. The result indicates that our model has certain degree of transfer capability. Compared to transition from phase I to phase III, transition from phase II to phase III improves by 1.91%, 0.22%, and 4.89% in Weight F1, Micro F1, and Macro F1, respectively. Considering the limited training data utilized in the cross-phase tasks, these experimental results should be regarded as indicative of our method.

Figure 6.

The performance of our method in predicting cross-phase transfer.

5.6. Decision making

To enhance the trust and understanding of clinical researchers in the model prediction results, we employ SHAP values to elucidate the decision-making process of our method. The explanatory result is shown in Figure 7, where the text marked in red increases the probability of our method predicting the clinical trial outcome as successful, while the text marked in blue decreases this probability. By highlighting the key factors that influence prediction outcomes, the understanding of our method among clinical researchers can be deepened, while also strengthening their trust in its decision-making process.

Figure 7.

Decision making process.

5.7. Case study

Case study comparing our method and zero-shot Gemini-Pro using the POCT dataset is presented in Table 7. The case includes the NCT ID, the clinical trial report, the outcome label, the prediction outcome of zero-shot Gemini-pro, the generated summarization text by Gemini-Pro, and the prediction outcome of our method. The clinical trial does not meet its primary endpoint. However, most of the content in the clinical trial report indicates that higher dose losmapimod reduced vascular inflammation in the most severely affected areas. This may cause challenges for Gemini-Pro in predicting this clinical trial outcome, leading to incorrect judgments. In contrast, our method filters out irrelevant information from clinical reports during summarization, thereby increasing the proportion of crucial information for predicting clinical trial outcomes. This allows our method to make accurate predictions, demonstrating its effectiveness in predicting clinical trial outcomes.

Table 7.
A case study for our method and zero-shot gemini-pro on dataset POCT.

Data fields Data values

NCT ID NCT00633022

Clinical Trial Report Objectives: This study sought to determine the effects of a p38 mitogen-activated protein kinase inhibitor, losmapimod, on vascular inflammation, by (18)F-fluorodeoxyglucose (FDG) positron emission tomography/computed tomography imaging.

Background: The p38 mitogen-activated protein kinase cascade plays an important role in the initiation and progression of inflammatory diseases, including atherosclerosis.

Methods: Patients with atherosclerosis on stable statin therapy (n $=$ 99) were randomized to receive losmapimod 7.5 mg once daily (lower dose [LD]), twice daily (higher dose [HD]) or placebo for 84 days. Vascular inflammation was assessed by FDG positron emission tomography/computed tomography imaging of the carotid arteries and aorta; analyses focused on the index vessel (the artery with the highest average maximum tissue-to-background ratio [TBR] at baseline). Seruminflammatory biomarkers and FDG uptake in visceral and subcutaneous fat were also measured.

Results: The primary endpoint, change from baseline in average TBR across all segments in the index vessel, was not significantly different between HD and placebo ( $Δ$ TBR: $- 0.04$ [95% confidence interval [CI]: $- 0.14$ to $+ 0.06$ ], p $=$ 0.452) or LD and placebo ( $Δ$ TBR: $- 0.02$ [95% CI: $- 0.11$ to $+ 0.06$ ], p $=$ 0.579). However, there was a statistically significant reduction in average TBR in active segments (TBR $\geq$ 1.6) (HD vs. placebo: $Δ$ TBR: $- 0.10$ [95% CI: $- 0.19$ to $- 0.02$ ], p $=$ 0.0125; LD vs. placebo: $Δ$ TBR: $- 0.10$ [95% CI: $- 0.18$ to $- 0.02$ ], p $=$ 0.0194). The probability of a segment being active was also significantly reduced for HD when compared with placebo (OR: 0.57 [95% CI: 0.41 to 0.81], p $=$ 0.002). Within the HD group, reductions were observed in placebo-corrected inflammatory biomarkers including high-sensitivity C-reactive protein (% reduction: $- 28$ % [95% CI: $- 46$ to $- 5$ ], p $=$ 0.023) as well as FDG uptake in visceral fat ( $Δ$ SUV: $- 0.05$ [95% CI: $- 0.09$ to $- 0.01$ ], p $=$ 0.018), but not subcutaneous fat.

Label Failure (Completed, Negative outcome/primary endpoint(s) not met)

Prediction outcome of Zero-shot Gemini-Pro Success

Generated summarization text by Gemini-Pro Summary: The study investigated the impact of losmapimod, a p38 mitogen-activated protein kinase inhibitor, on vascular inflammation using FDG PET/CT imaging. Patients with atherosclerosis (n $=$ 99) received losmapimod or placebo for 84 days. While the primary endpoint, change in average TBR across all vessel segments, did not significantly differ between losmapimod and placebo, a notable reduction in average TBR in active segments was observed with losmapimod. Additionally, losmapimod led to significant reductions in inflammatory biomarkers and FDG uptake in visceral fat, suggesting a systemic anti-inflammatory effect.

Prediction outcome of our method Failure

Data fields	Data values
NCT ID	NCT00633022
Clinical Trial Report	Objectives: This study sought to determine the effects of a p38 mitogen-activated protein kinase inhibitor, losmapimod, on vascular inflammation, by (18)F-fluorodeoxyglucose (FDG) positron emission tomography/computed tomography imaging.
	Background: The p38 mitogen-activated protein kinase cascade plays an important role in the initiation and progression of inflammatory diseases, including atherosclerosis.
	Methods: Patients with atherosclerosis on stable statin therapy (n $=$ 99) were randomized to receive losmapimod 7.5 mg once daily (lower dose [LD]), twice daily (higher dose [HD]) or placebo for 84 days. Vascular inflammation was assessed by FDG positron emission tomography/computed tomography imaging of the carotid arteries and aorta; analyses focused on the index vessel (the artery with the highest average maximum tissue-to-background ratio [TBR] at baseline). Seruminflammatory biomarkers and FDG uptake in visceral and subcutaneous fat were also measured.
	Results: The primary endpoint, change from baseline in average TBR across all segments in the index vessel, was not significantly different between HD and placebo ( $Δ$ TBR: $- 0.04$ [95% confidence interval [CI]: $- 0.14$ to $+ 0.06$ ], p $=$ 0.452) or LD and placebo ( $Δ$ TBR: $- 0.02$ [95% CI: $- 0.11$ to $+ 0.06$ ], p $=$ 0.579). However, there was a statistically significant reduction in average TBR in active segments (TBR $\geq$ 1.6) (HD vs. placebo: $Δ$ TBR: $- 0.10$ [95% CI: $- 0.19$ to $- 0.02$ ], p $=$ 0.0125; LD vs. placebo: $Δ$ TBR: $- 0.10$ [95% CI: $- 0.18$ to $- 0.02$ ], p $=$ 0.0194). The probability of a segment being active was also significantly reduced for HD when compared with placebo (OR: 0.57 [95% CI: 0.41 to 0.81], p $=$ 0.002). Within the HD group, reductions were observed in placebo-corrected inflammatory biomarkers including high-sensitivity C-reactive protein (% reduction: $- 28$ % [95% CI: $- 46$ to $- 5$ ], p $=$ 0.023) as well as FDG uptake in visceral fat ( $Δ$ SUV: $- 0.05$ [95% CI: $- 0.09$ to $- 0.01$ ], p $=$ 0.018), but not subcutaneous fat.
Label	Failure (Completed, Negative outcome/primary endpoint(s) not met)
Prediction outcome of Zero-shot Gemini-Pro	Success
Generated summarization text by Gemini-Pro	Summary: The study investigated the impact of losmapimod, a p38 mitogen-activated protein kinase inhibitor, on vascular inflammation using FDG PET/CT imaging. Patients with atherosclerosis (n $=$ 99) received losmapimod or placebo for 84 days. While the primary endpoint, change in average TBR across all vessel segments, did not significantly differ between losmapimod and placebo, a notable reduction in average TBR in active segments was observed with losmapimod. Additionally, losmapimod led to significant reductions in inflammatory biomarkers and FDG uptake in visceral fat, suggesting a systemic anti-inflammatory effect.
Prediction outcome of our method	Failure

6. Conclusion

Predicting clinical trial outcomes effectively has become increasingly challenging in clinical research. To address this issue, clinical trial reports are explored by associating with clinical trials to build a new dataset POCT. In addition, a method based on two-stage strategy is newly proposed for predicting clinical trial outcomes. The method utilizes large language models to summarize clinical trial reports and trains classifiers with the generated summaries to achieve full-phase clinical trial outcome prediction. Our method was evaluated by comparing it to other baseline models, and the results illustrate the best performance in predicting general clinical trial outcomes, particularly when dealing with a small amount of training data under a data imbalance situation. While our research demonstrates certain advantages, further refinements are possible. Clinical trial reports contain a wealth of intricate medical terminology, which present significant challenges for model comprehension. Therefore, future research could benefit from integrating specialized medical knowledge bases to enhance understanding and performance.

Footnotes

Acknowledgements

The work is supported by grants from National Natural Science Foundation of China (No. 62372189).

ORCID iD

Tianyong Hao

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Huang

Xiao

, et al. HINT: hierarchical interaction network for clinical-trial-outcome predictions. Patterns 2022; 3: 10445.

Luo

Qiao

Glass

, et al. Clinicalrisk: a new therapy-related clinical trial dataset for predicting trial status and failure reasons. In: Proceedings of the 32nd ACM international conference on information and knowledge management, 2023, pp.5356–5360.

Blass

. Basic principles of drug discovery and development. Philadelphia, USA: Elsevier, 2015.

Bentley

Cressman

van der Hoek

, et al. Conducting clinical trials–costs, impacts, and the value of clinical trials networks: a scoping review. Clin Trials 2019; 16: 183–193.

Friedman

Furberg

DeMets

, et al. Fundamentals of clinical trials. North Bethesda, USA: Springer, 2015.

Heinemann

Huber

Meisel

, et al. Reflection of successful anticancer drug development processes in the literature. Drug Discov Today 2016; 21: 1740–1744.

Munos

Niederreiter

Riccaboni

. Improving the prediction of clinical success using machine learning. medRxiv 2021.

Follett

Geletta

Laugerman

. Quantifying risk associated with clinical trial termination: a text mining approach. Inf Process Manag 2019; 56: 516–525.

Geletta

Follett

Laugerman

. Latent dirichlet allocation in predicting clinical trial terminations. BMC Med Inform Decis Mak 2019; 19: 1–12.

10.

Jin

Tan

Chen

, et al. Predicting clinical trial results by implicit evidence integration. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), 2020, pp.1461–1477.

11.

Katsimpras

Paliouras

. Predicting intervention approval in clinical trials through multi-document summarization. In: Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022, pp.1947–1957.

12.

Lehman

DeYoung

Barzilay

, et al. Inferring which medical treatments work from reports of clinical trials. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long and Short Papers), 2019, pp.3705–3717.

13.

Marshall

Kuiper

Banner

, et al. Automating biomedical evidence synthesis: Robotreviewer. In: Proceedings of the conference. Association for computational linguistics. Meeting, volume 2017, 2017, p.7. NIH Public Access.

14.

Wang

Sun

. Trial2vec: zero-shot clinical trial document similarity search using self-supervision. In: 2022 Findings of the association for computational linguistics: EMNLP 2022, 2022.

15.

Gayvert

Madhukar

Elemento

. A data-driven approach to predicting successes and failures of clinical trials. Cell Chem Biol 2016; 23: 1294–1301.

16.

Siah

Kelley

Ballerstedt

, et al. Predicting drug approvals: the novartis data science and artificial intelligence challenge. Patterns 2021; 2: 100312.

17.

Levy

Micheel

, et al. Identifying the status of genetic lesions in cancer clinical trial documents using machine learning. BMC Genom 2012; 13: 1–9.

18.

Gao

Xiao

Glass

, et al. Compose: cross-modal pseudo-siamese network for patient trial matching. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp.803–812.

19.

Zhang

Xiao

Glass

, et al. Deepenroll: patient-trial matching with deep embedding and entailment prediction. In: Proceedings of the web conference 2020, 2020b, pp.1029–1037.

20.

Tang

. Predicting phase 3 clinical trial results by modeling phase 2 clinical trial subject level data using deep learning. In: Machine learning for healthcare conference, 2019, pp.288–303. PMLR.

21.

Siah

Wong

. Machine learning with statistical imputation for predicting drug approvals. Available at SSRN 2973611.

22.

Liang

Tsou

C-H

. A novel system for extractive clinical note summarization using EHR data. NAACL HLT 2019 2019: 46–54.

23.

Lins

Oliveira

Cabral

, et al. The CNN-Corpus: a large textual corpus for single-document extractive summarization. In: Proceedings of the ACM symposium on document engineering 2019, 2019, pp.1–10.

24.

Sosea

Zhan

, et al. Unsupervised extractive summarization of emotion triggers. In: Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2023, pp.9550–9569. Association for Computational Linguistics.

25.

Allahyari

Pouriyeh

Assefi

, et al. Text summarization techniques: a brief survey. Int J Adv Comput Sci Appl 2017; 8: 397–405.

26.

Chern

Wang

Das

, et al. Improving factuality of abstractive summarization via contrastive reward learning. In: Proceedings of the 3rd workshop on trustworthy natural language processing (TrustNLP 2023), 2023, pp.55–60.

27.

Wang

Demberg

. Incorporating distributions of discourse structure for long document abstractive summarization. In: Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2023, pp.5574–5590. Association for Computational Linguistics.

28.

Lewis

Liu

Goyal

, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp.7871–7880.

29.

Radford

Child

, et al. Language models are unsupervised multitask learners. OpenAI Blog 2019; 1: 9.

30.

Nentidis

. Overview of bioasq 2021: the ninth bioasq challenge on large-scale biomedical semantic indexing and question answering. In: Experimental IR meets multilinguality, multimodality, and interaction: 12th International conference of the CLEF association, CLEF 2021, virtual event, September 21–24, 2021, proceedings, volume 12880, 2021, pp.239. Springer Nature.

31.

Wallace

Saha

Soboczenski

, et al. Generating (factual?) narrative summaries of rcts: experiments with neural multi-document summarization. AMIA Summit Transl Sci Proc 2021; 2021: 605.

32.

Team

Anil

Borgeaud

, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.

33.

Lee

Dernoncourt

. Sequential short-text classification with recurrent and convolutional neural networks. In: Proceedings of NAACL-HLT, 2016, pp.515–520.

34.

Wang

Zhang

, et al. Combining knowledge with deep convolutional neural networks for short text classification. In: Proceedings of the 26th international joint conference on artificial intelligence, 2017, pp.2915–2921.

35.

Kenton

JDM -WC

Toutanova

. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, 2019, pp.4171–4186.

36.

Liu

Ott

Goyal

, et al. ROBERTA: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.

37.

Sun

Qiu

, et al. How to fine-tune bert for text classification? In: Chinese computational linguistics: 18th China national conference, CCL 2019, kunming, China, October 18–20, 2019, proceedings 18, 2019, pp.194–206. Springer.

38.

Wang

, et al. Multi-passage bert: A globally normalized bert model for open-domain question answering. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2019, pp.5878–5882.

39.

Beltagy

Peters

Cohan

. Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150, 2020.

40.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30: 5998–6008.

41.

Liu

Gao

, et al. Deberta: decoding-enhanced bert with disentangled attention. In: International conference on learning representations, 2020.

42.

Zhang

Zhao

Saleh

, et al. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In: International conference on machine learning, 2020a, pp.11328–11339. PMLR.

43.

Paszke

Gross

Massa

, et al. Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 2019; 32: 8024–8035.

44.

Loshchilov

Hutter

. Fixing weight decay regularization in adam, 2018.

		Methods
Phase Data	Metrics	BERT	Longformer	Gemini-Pro	Our
Phase I	Weight F1	68.82	68.82	76.64	77.60
	Micro F1	78.33	78.33	75.00	80.00
	Macro F1	43.92	43.92	68.65	64.00
	F1_Score	87.85	87.85	82.75	88.00
Phase II	Weight F1	64.04	76.28	78.91	80.02
	Micro F1	72.55	77.33	78.79	80.04
	Macro F1	51.48	70.91	75.20	75.90
	F1_Score	83.45	84.58	84.63	86.37
Phase III	Weight F1	68.00	86.20	83.30	84.81
	Micro F1	77.74	86.45	83.24	84.62
	Macro F1	43.73	79.71	75.93	78.33
	F1_Score	87.47	91.40	89.19	90.06
Average		68.94	77.64	79.35	80.81

A two-stage framework by leveraging large language model for predicting clinical trial outcomes

Abstract

Keywords

1. Introduction

2.1. Clinical trial outcome prediction

2.2. Text classification and summarization

3. The dataset POCT

Table 2. The statistical information of the POCT dataset. Label Training Developing Testing Total Success 5,926 1,270 1,270 8,466 Failure 1,957 420 420 2,797 Total 7,883 1,690 1,690 11,263

5.1. Settings

Table 5. The ablation study on the POCT dataset. Section Weight F1 Micro F1 Macro F1 F1 Avg. ALL 80.48 81.53 72.63 88.24 80.72 w/o R 80.17 81.18 72.27 87.99 80.40 w/o M R 73.84 75.32 63.24 84.31 74.17

Footnotes

Acknowledgements

ORCID iD

Funding

Conflicting interests

References

Table 2.
The statistical information of the POCT dataset.

Label Training Developing Testing Total

Success 5,926 1,270 1,270 8,466

Failure 1,957 420 420 2,797

Total 7,883 1,690 1,690 11,263

Table 5.
The ablation study on the POCT dataset.

Section Weight F1 Micro F1 Macro F1 F1 Avg.

ALL 80.48 81.53 72.63 88.24 80.72

w/o R 80.17 81.18 72.27 87.99 80.40

w/o $M_{R}$ 73.84 75.32 63.24 84.31 74.17