Statistical challenges in assessing potential efficacy of complex interventions in pilot or feasibility studies

Abstract

Early phase trials of complex interventions currently focus on assessing the feasibility of a large randomised control trial and on conducting pilot work. Assessing the efficacy of the proposed intervention is generally discouraged, due to concerns of underpowered hypothesis testing. In contrast, early assessment of efficacy is common for drug therapies, where phase II trials are often used as a screening mechanism to identify promising treatments. In this paper, we outline the challenges encountered in extending ideas developed in the phase II drug trial literature to the complex intervention setting. The prevalence of multiple endpoints and clustering of outcome data are identified as important considerations, having implications for timely and robust determination of optimal trial design parameters. The potential for Bayesian methods to help to identify robust trial designs and optimal decision rules is also explored.

Keywords

Complex interventions pilot feasibility phase II sample size

1 Introduction

Complex interventions contain several distinct and potentially interacting components, each of which may contribute to the efficacy of an intervention as a whole.¹ For example, psychotherapy may be viewed as being composed of two treatment variables, namely techniques described in a therapy manual together with a therapist delivering these techniques.² This contrasts with typical drug treatments, where the drug is the only treatment variable to consider. While drug regimens may be complex,³ randomisation and blinding allow the effects of a drug to be separated from the context in which it is provided. Broad classes of complex interventions include surgical, behavioural, psychological, educational and physical interventions. The evaluation of a complex intervention raises specific challenges, and several frameworks have therefore been proposed to guide this process. These include a widely used framework proposed by the MRC^1,4 the IDEAL initiative aimed at surgical interventions;⁵ and the Multiphase Optimisation STrategy (MOST).^6,7 The most recent MRC framework is summarised in Figure 1.

Figure 1.

Current MRC guidance on the development and evaluation of complex interventions, adapted from Craig et al.¹

As shown in Figure 1, ‘feasibility and piloting’ is identified as one of four key stages in the development and evaluation of complex interventions. While the definitions of, and distinctions between, feasibility and pilot studies are not always clear,^8–10 the MRC guidance states that the purpose of this stage is to inform the design of a subsequent large, definitive trial assessing the effectiveness of the intervention. Several parameters required for designing the definitive trial may be estimated at this stage, including the variance of the proposed outcome measure(s), recruitment and follow-up rates, and intra-class correlation coefficients (ICCs) in trials with clustering effects. Characteristics relating more directly to the intervention, such as its acceptability and the level of adherence, may also be assessed. Gathering information relating to these factors reduces the likelihood of the large trial failing due to poor design.

While MRC guidance recommends evaluating a complex intervention following feasibility or pilot work, in practice it is not uncommon for feasibility or pilot studies to include evaluation through hypothesis testing. For example, a recent review found that 21 of 26 feasibility and pilot studies surveyed included a hypothesis test.⁸ However, the size of these studies is often derived using generic rules of thumb^11,12 rather than through formal power calculations, with the review finding that only 9 of the 26 studies reported a sample size calculation.⁸ As a result, hypothesis tests are likely to be underpowered¹³ and the typical recommendation is that such tests should be de-emphasised, interpreted with extreme caution, or avoided altogether.^8,13–15

It could be argued, however, that a formal assessment of potential efficacy or activity should be carried out in pilot and feasibility studies, and that such studies should be properly designed to address this objective. In this manner, feasibility or pilot work could not only ensure that subsequent large scale randomised controlled trials (RCTs) of complex interventions are well designed, but could also reduce the rate at which such trials fail due to an inherently ineffective intervention. Moreover, this approach would clearly be more efficient than conducting a feasibility or pilot study and then a separate study assessing only efficacy. To begin developing such designs, one may look to methods developed in the drug setting. There, small ‘phase II’ trials which focus on making an early assessment of efficacy, identifying promising and discarding unpromising drug treatments, are commonplace.

The application of phase II designs to the complex intervention setting is not straightforward due to challenges that are commonly encountered in complex intervention trials. For example, the assumption that patient outcomes are statistically independent is often violated as a consequence of cluster randomisation,¹⁶ a group-based intervention¹⁷ or therapist variation.¹⁸ The associated implications for precision are compounded in cases where only a small number of clusters are available, as is often the case in feasibility or pilot studies.¹⁹ Furthermore, the multi-component nature of complex interventions will mean that an assessment of efficacy will often have to be based on multiple endpoints,¹ in contrast to the single binary indicator of ‘success’ often used in phase II studies.

An example which serves to illustrate each of these points is the OK-Diabetes (OK-D) feasibility trial of a supported self-management intervention for adults with type II diabetes and learning disabilities.²⁰ The OK-D study involves first developing a manualised intervention and then carrying out a randomised feasibility study whose objectives include estimation of recruitment rates, testing of data collection forms and assessment of the feasibility of delivering the intervention. While the feasibility study individually randomises treatment packages to patients, diabetes specialist nurses provide the intervention and may, therefore, induce a clustering effect in the intervention arm. Furthermore, as the intervention is newly developed only two nurses will be involved. The intervention is targeted at three aspects of poor diabetes self-management, and as a result there are a number of possible outcomes to be considered when assessing efficacy.

It has been proposed that the OK-D feasibility trial be extended to allow for a preliminary assessment of the efficacy of the developed intervention as a formal objective, highlighting the need for appropriate trial design methodology. In this paper, we will review the approach to assessing efficacy developed in the context of phase II trials for drug therapies, setting out the key methodological challenges to their application in feasibility and pilot studies of complex interventions, and thereby outlining future directions for methodological research. In section 2, an overview of phase II designs and their key characteristics will be provided. In section 3, multiple endpoints and clustering will be discussed in detail, considering the formulation of decision rules, difficulties arising from nuisance parameters, and practical difficulties in determining sample size in a timely manner. Finally, in section 4 conclusions are drawn and further avenues for future research are suggested.

2 Efficacy evaluation in oncology drug trials

Following the determination of a safe dose in phase I, but before a definitive RCT in phase III, phase II trials typically act as a screening mechanism to screen out ineffective drugs at an early stage and progress only the most promising treatments to phase III. Phase II designs tend to employ a decision-focussed approach to inference, with an emphasis on determining if a subsequent phase III trial is warranted as opposed to estimation of underlying parameters. This approach is typically sustained through the use of Neyman–Pearson hypothesis testing or, alternatively, through Bayesian decision-theoretic methods.²¹

In the case of hypothesis testing, trial design focusses on ensuring type I and II error rates remain within pre-specified nominal bounds. Perhaps the simplest phase II design to employ hypothesis testing for a single binary outcome was proposed by Fleming,²² then extended by A’Hern²³ from an approximate to an exact test. To use the design, one must first specify a success rate p₀ which, if true, would mean the new intervention would not be worthy of further investigation. An alternative hypothesis p_A must then be given, corresponding to a success rate which would certainly merit a full evaluation in a definitive RCT. Applying this to the OK-D problem, we could set $p_{0} = 0.05$ and $p_{A} = 0.2$ . The A'Hern design for this problem, guaranteeing a type I error rate of 5% or less and a power of at least 90%, would be a single-arm trial with a sample size of n = 38. Decision making at the end of the trial is then based on counting the number of successes observed, denoted s, and comparing this with the design-derived cut-off point c. In this example, if $s \geq c = 4$ the intervention should proceed to a definitive RCT, otherwise its evaluation should be terminated.

A wide range of alternative phase II designs have been published, accounting for the variety of problems to which they may be applied.²⁴ Only a brief overview of the main differences between designs, with references to examples, is considered here. One point of differentiation between the designs is in the number of stages. While the A’Hern²³ design described above involved a single decision point, designs such as those proposed by Simon²⁵ include an additional interim analysis to allow for the phase II trial to terminate early due to futility. Single-arm designs may be contrasted with randomised designs,²⁶ which allow a concurrent as opposed to historical control to be used. Multi-arm designs, for cases where multiple treatments are available for evaluation at once, have also been described.²⁷ While the majority of phase II designs focus on a single endpoint relating to efficacy, several have been proposed which can consider additional measures relating to, for example, toxicity^28,29 or further aspects of efficacy.³⁰

In addition to hypothesis testing designs, there are also a number which adopt a Bayesian framework. These vary in the extent to which Bayesian methodology is employed, from allowing some prior information to be incorporated in the form of probability distributions³¹ to full decision-theoretic frameworks.^32,33 The multitude of designs available requires a thorough assessment of the key design criteria specific to the trial in question, to ensure an appropriate design is selected.

3 Efficacy evaluation in complex intervention trials

When applying ideas from phase II trials in an early phase complex intervention setting, it is important to take account of complexities relating to (i) prevalence of multiple endpoints and (ii) recruitment- and treatment-related clustering effects.

3.1 Multiple endpoints

Multiple endpoints, on which the decision of proceeding to phase III should be based, arise due to several reasons. In addition to an assessment of efficacy requiring several endpoints, due to the multi-component nature of the intervention, endpoints relating to safety, acceptability and adherence are often required. Further to these, endpoints relating to the feasibility of a phase III trial, such as measurements of the recruitment and follow-up rates, should be taken into account. Thus, while phase II drug trials are not always limited to a single endpoint, early phase evaluations of complex interventions may routinely involve more.

As an example, the original design of the OK-D feasibility study included three feasibility criteria which were to be met to consider progression to phase III. These took the form of threshold values of recruitment rate, numbers lost to follow up, and adherence of participants in the intervention arm. In addition to these three endpoints, a further four endpoints were of interest in terms of assessing efficacy. Specifically, continuous measurements of glycated haemoglobin (HbA1c), blood pressure, total cholesterol and body mass (BMI) were all proposed as potential efficacy endpoints, with no single one anticipated to be sensitive to all components of the intervention.

3.1.1 Decision rules

In the single-endpoint case, a decision rule regarding progression to a phase III trial can be defined by a single cut-off point, as was illustrated in the example in section 2. Where several endpoints are present, specifying the form of the decision rule for progression to phase III becomes more complex. This problem has been addressed to a limited extent in the drug setting. In the case of two binary endpoints describing efficacy and toxicity, phase II designs such as that of Bryant and Day²⁸ consider separate null and alternative hypotheses for the two endpoints, resulting in four ‘states of nature’. Specifically, defining ‘unacceptable’ and ‘acceptable’ levels of both efficacy and toxicity as $p_{E 0}, p_{E 1}, p_{T 0}, p_{T 1},$ the four states are defined as $H_{ij} : p_{E} = p_{Ei}, p_{T} = p_{Tj}$ for i,j = 0,1. The design aims to ensure that the probability of rejecting the drug when it has satisfactory efficacy and toxicity, i.e. the type II error, remains within a nominal bound. Two separate type I errors are also kept within nominal bounds, relating to the probability of proceeding to phase III with an ineffective or a toxic drug. The resulting design specifies a cut-off point for each endpoint, both of which must be reached for the drug to proceed to phase III. This can be illustrated graphically as an acceptance region, as shown in Figure 2(a).

Figure 2.

Example decision rules arising from phase II designs for two endpoints, where the shaded area of the sample space represents the decision to progress to phase III: (a) $x \geq C_{x}$ and $y \geq C_{y}$ ; (b) $x \geq C_{x}$ or $y \geq C_{y}$ .

In cases where two binary measures of efficacy are of interest, phase II designs such as that of Sill et al.³⁰ employ a rule whereby we proceed to phase III if either quantity reaches the specified cut-off point. The form of the resulting acceptance region is illustrated in Figure 2(b). Again, the form of this rule dictates the possible types of errors, with a single type I error rate in this case and two type II error rates. One advantage of these decision rules is the ability to discriminate between endpoints through their nominal error rates. For example, in the case of the Bryant and Day design, progressing to phase III with a toxic treatment may be considered more of a risk than progressing with an ineffective treatment, and so the nominal type I error rate relating to toxicity could be set to a lower level to ensure this error is less likely. Similarly, in the case of two efficacy endpoints and the Sill et al. design, one could set the nominal type II error of the preferred endpoint to be lower than the other to ensure the trial will be more likely to detect a treatment which is promising in this respect.

Beyond this use of nominal error rates, designs such as those of Bryant and Day²⁸ and Sill et al.³⁰ do not provide any means with which to describe any relative preferences between different qualities of the treatment. With multiple endpoints to consider, it is possible that the decision to progress to phase III could involve trading off one aspect of the treatment against another. For example, one may be happy to accept a slightly toxic treatment if it demonstrated substantial efficacy, but not if efficacy was only moderate. Phase II designs allowing for such a trade-off were proposed by Conaway and Petroni.^34,35 Considering binary efficacy and toxicity endpoints with parameters p_E and p_T, the authors propose dividing the parameter space $0 \leq p_{E}, p_{T} \leq 1$ into two complementary subspaces defining the null and alternative hypotheses. They propose a statistical test based on the ‘I-divergence’ measure,³⁶ with the statistic being analogous to the distance of the observed sample from the null hypothesis subspace. Type I and II error rates are defined (the latter with respect to a point alternative hypothesis), and the choice of sample size made to ensure error rates remain within pre-specified bounds. It is noted that the method may be applied to general specifications of the null hypothesis space, and is suggested that future research consider extending the design to allow for more general loss functions than the 0–1 loss implicit in the proposed method. While providing more flexibility when specifying trade-offs between endpoints, in comparison to the design of Bryant and Day²⁸ this design has been shown to lack robustness to misspecification of the degree of correlation between them.³⁷

An alternative approach to acknowledging the presence of multiple endpoints is proposed by Sargent et al.³⁸ In this phase II design, the decision space related to the trial is expanded from {stop, go} to include a third, intermediate decision. Considering an explicit primary endpoint, if the corresponding observations are strong enough (in either direction) the trial will lead to one of stop or go. If the observations are less conclusive, it is suggested that the decision should now be made by considering other endpoints of interest. This design therefore provides a formal mechanism to allow for the inclusion of more than one endpoint without requiring any specification of their nature or relationship to one another at the design stage. All that is assumed is that a partial ordering of preference exists, with the primary endpoint considered more important than all other endpoints. As such, it represents a flexible methodology which could be applied to the complex intervention setting where many endpoints are of interest. The extra complexity of the decision rule does require that two addition nominal probabilities, relating to the minimum probability of making correct decisions under the null and alternative hypotheses, are specified. By way of illustration, in the same setting as that described in section 2 (i.e. $p_{0} = 0.05$ , $p_{A} = 0.2$ , nominal type I error 0.05 and 90% power), a design which guarantees correct decision rates of 0.8 would specify a total of 27 participants. If $s \leq 2$ the decision is made to stop, while if $s \geq 4$ the decision is made to proceed. If $2 < s < 4$ an ‘inconclusive’ decision is made based on the primary endpoint, and additional endpoints considered.

A further option which should be considered as a means with which to effectively address the challenge of multiple endpoints is to use a Bayesian decision-theoretic framework, as employed by Stallard et al.²⁹ and others in the drug context. This involves the specification of a utility function $u (d, θ)$ which assigns a quantitative value to each possible decision d under every state of nature θ. For example, consider the case of two binary endpoints relating to toxicity (p_T) and efficacy (p_E), as discussed by Bryant and Day.²⁸ We now wish to assign a utility to each of the decisions {stop, go} for each value of $(p_{T}, p_{E}) \in [0, 1]^{2}$ . If beliefs regarding the likely values of parameters p_T and p_E can be specified through probability distributions, it is possible to calculate the expected utility of any decision d by averaging the utility function over the parameter space. Then, when faced with deciding whether or not the intervention should progress to phase III, the decision with maximum expected utility (MEU) can be selected. The same MEU principle can be applied when determining the sample size of the trial in question. In order to do so, prior distributions on the parameters of interest must be elicited, after which the trial design which maximises expected utility over all possible trial outcomes can be found.³⁹ In the context of the present discussion, the specification of a utility provides a highly flexible means with which to encode the preferences of the decision maker(s). Allowing us to explicitly quantify any acceptable trade-offs between different endpoints, this approach will lead to decisions which are optimal with respect to these preferences. Whilst the specification of an appropriate subjective utility function may be difficult,⁴⁰ it should be emphasised that Frequentist trial design can also involve subjective judgement when selecting nominal error rates, and that this may be less intuitive than the Bayesian alternative.⁴¹ Moreover, recent examples such as the early phase drug trial described by Thall et al.⁴² demonstrate the feasibility of employing Bayesian decision theoretic designs in practice. Where it is not feasible to specify a utility function, alternative Bayesian methods for sample size determination are available.⁴³

3.1.2 Trial specification

A further difficulty arising from the use of multiple endpoints is encountered when setting the specific design parameters for the trial. As illustrated in Figure 2, increases in the number of endpoints can correspond to increases to the dimensions of space in which the decision rule is defined. Accordingly, the number of potential decision rules which could be considered can increase. This feature can been seen in the phase II context when comparing the two-stage design of Simon,²⁵ which accounts for a single endpoint, with its extension to the two-endpoint setting proposed by Bryant and Day.²⁸ In that case, given a proposed maximum sample size for each stage of the trial, n, the two-endpoint design will have a factor of n²more possible parameterisations than the single-endpoint design. As a result, the task of finding the specific ‘best’ parameterisation becomes more demanding and less amenable to simple, exhaustive search methods. This point is noted by Sill et al.,³⁰ who propose heuristic methods to find good parameterisations of their two-endpoint phase II design.

In the complex intervention setting, the presence of several endpoints will compound this difficulty and lead to more sophisticated optimisation routines being required as standard. The design of any such algorithms will be strongly influenced by the nature of the endpoints considered. Binary endpoints will lead to integer trial design parameters (e.g. the threshold number of observed successes), whereas continuous endpoints will lead to continuous design parameters (e.g. the threshold of a t-test). Optimisation algorithms are typically tailored to specific problem types,⁴⁴ and so different methods will generally be required to solve different problem types efficiently. Metaheuristic algorithms such as genetic optimisation, as implemented in the R package ‘rgenoud’,⁴⁵ may provide a flexible solution methodology to address this difficulty, requiring only the tuning of algorithm parameters to ensure good performance.

Where a Bayesian decision-theoretic framework is employed, a decision rule does not have to be specified in advance. The aforementioned method of MEU does not require one,³⁹ instead determining the decision by choosing that which, conditional on the observed data, has greatest expected utility. As a result, when determining the best specification for a trial one will not need to explore different decision rules. The addition of further endpoints will therefore not lead to a more challenging trial design problem, in contrast with some frequentist cases.

3.2 Clustering

Clustering is a common feature of complex intervention trials and may arise with or without cluster randomisation.⁴⁶ For example, while the OK-D feasibility trial is individually randomised, the assumption that patient outcomes are independent is questionable. This is due to the fact that, in the intervention arm of the trial, each participant is allocated to one of a limited number of trained research nurses whose role is to provide support in the delivery of the intervention. The study design is summarised in Figure 3.

Figure 3.

Clustering within the OK-D feasibility study, where patients are randomised to intervention or control and those within the intervention arm are allocated nurses.

The OK-D study may be described as having an individually randomised, two level, partially nested hierarchical design and is one of many possible scenarios where one or more sources of clustering are present.¹⁷ By partially nested, we refer to the fact that clustering by research nurse is present in only one of the two arms, and by hierarchical we mean that there is a single research nurse per patient. More generally, the relationship between clusters and patients may be hierarchical, cross-classified (where patients are allocated to more than one type of cluster) or multiple-membership (where patients are allocated to more than one cluster of the same type). In terms of the relationship between treatments and clusters, this could be described as partially or fully nested, partially or fully crossed, or a mix of these for trials with more arms.¹⁷ Specifically, nested designs have different clusters in each arm. For example, Schnurr et al.⁴⁷ describe a nested trial comparing Prolonged Exposure to Present-Centred Therapy for women with Posttraumatic Stress Disorder, where each therapist delivered only one of the treatments. Crossed designs have different arms associated with the same clusters.⁴⁸ Cohen and Mannarino⁴⁹ describe one such trial, comparing Cognitive Behavioural Therapy with Nondirective Supportive Therapy for sexually abused children, where therapists delivered both treatments.

In seeking to apply a phase II design to a problem where clustering is present, the simplest approach would be to ignore the clustering and apply the design ‘off-the-shelf’ without any modification. However, this can lead to inaccurate estimates of the type I error rate of any proposed trial⁵⁰ with the actual rate being higher than that calculated when designing the trial. As such, this approach would lead to ineffective interventions being taken forward for further evaluation in a phase III trial. A phase II design could be extended to account for clustering by including fixed cluster effects. However, such an analysis would imply a restricted focus on the specific clusters considered in that trial, preventing any generalisation to a wider population. In the case of the OK-D feasibility study, this would correspond to restricting attention to only those nurse therapists participating in the experiment, as opposed to considering the larger population of therapists from which they are ‘sampled’.^19,51,52 While it has been argued that this perspective is appropriate in the early phase of development,⁵¹ it is possible to improve the generalisability of the analysis by using random cluster effects rather than fixed. This approach has been recommended to account for clustering in individually randomised trials,^46,53 but will lead to a more complex linear mixed effects model.

3.2.1 Complex likelihoods

The hypothesis testing approach typical of phase II trials requires the specification of a test statistic and the derivation of that statistic’s sampling distribution under the null and alternative hypotheses. Given analytical formulae describing these distributions, error rates for any decision rule can then be found by examining their tail areas. This approach is feasible in cases such as those considered by Fleming²² and A’Hern,²³ where the distribution of the test statistic (a count of binary ‘successes’) is simply the binomial distribution. In multilevel statistical models, as found in trials where clustering is present, statistics such as a mean difference in a linear mixed effects model fitted by maximum likelihood will not necessarily have known analytical sampling distributions,^54,55 particularly in our setting where low sample sizes preclude the use of asymptotic results.⁵⁶

When analytical results describing the sampling distribution of the test statistic are not available, Monte Carlo simulation may be employed to estimate type I and II error rates.^55,57 This involves simulating a number of hypothetical data sets according to a population model which corresponds to either the null or alternative hypothesis and, for each data set, calculating the test statistic. Implementing the proposed decision rule, the resulting action can be compared with the hypothesis used to generate the data and any error, type I or II, counted. This general technique is highly flexible. It can be applied to almost any multilevel structure encountered in practice,^58,59 using any proposed statistic in the analysis. However, this flexibility comes at the expense of a computational burden. The Monte Carlo method can require a significant amount of CPU time in order to perform enough simulations to provide an accurate estimate of error rates. The binary nature of both type I and II errors implies that the width of a confidence interval around an estimated error rate will decrease at a rate of $K / \sqrt{r}$ for a constant K and r simulations. For example, to ensure a 95% confidence interval of $\pm 0.05$ around an estimated type II error rate of 0.2, one would require r = 24,586 simulation runs. In practice, this may impose a limit on the number of trial specifications which can be considered and evaluated before one is chosen.

The computational burden of simulations may be reduced through their implementation in efficient programming languages such as C++. However, it has been argued that the resulting lack of transparency and difficulties in interpretation, in comparison to popular statistical programming packages such as R, should be taken into account when considering this option.⁶⁰ Alternatively, one may expedite the process of selecting an appropriate sample size by simplifying the problem. This technique is used in the freely available MLPowSim⁵⁸ software, which identifies a sensible choice of sample size by calculating the power of a restricted grid of designs, incrementing sample size parameters such as the number of clusters and the number of patients per cluster in large steps. By not considering every possible combination of sample size parameters, precision is sacrificed for speed. In the Stata routine SimSam,⁵⁹ the problem is simplified by assuming all but one sample size parameters are known and fixed. Using heuristics to increase the efficiency of the search process, the optimal value of the remaining parameter (e.g. the number of patients per cluster) can be found in a timely manner. An alternative approach would be to use optimisation algorithms which employ surrogate models, such as efficient global optimisation (EGO)⁶¹ and its variants, to search over the full space of sample size configurations. These algorithms rely on fitting models, such as Gaussian processes, to the simulated data obtained over a limited number of initial sample size configurations. Optimisation then takes place over the surrogate model, increasing efficiency as each evaluation now requires a simple calculation as opposed to a full simulation process. As these algorithms and their components have been implemented in R packages⁶² and C++ libraries,⁶³ they can be employed for this purpose without significant difficulty.

The simulation approach may be difficult to implement in cases where ‘nuisance parameters’ are present in the statistical model. This will often be the case where clustering is present. For example, in a fully nested design one would require a value for the ICC to be used in the population model when generating the data at each step. While it has become increasingly common for ICCs to be reported in the results of trials,⁶⁴ the early phase context of feasibility and pilot studies implies that little information will be available for the intervention in question. Indeed, gathering information to inform future estimates of ICCs is a common objective of feasibility studies.⁸ Thus, calculations of error rates may be dependent on parameter estimates in which there is significant uncertainty. The effect of such uncertainty in ICC estimates on type II error rates and required sample size has been shown to be considerable.^65,66 One approach to address this difficulty would be to carry out a sensitivity analysis, using several values of the nuisance parameter covering an appropriate range in order to identify its effect.⁵⁵ However, this would further contribute to the computational burden of the simulation approach.

In cases where some information regarding the likely values of nuisance parameters is available, a Bayesian approach would allow for this to be included formally via prior probability distributions.⁴³ This would fit naturally into the simulation method described thus far, allowing the data generated by the population model to encapsulate uncertainty in the nuisance parameters, leading to more robust estimates of error rates. In the case of ICCs in cluster randomised trials, the use of a prior distribution has been shown to significantly affect both design^67,68 and analysis.^65,66 In addition to acknowledging uncertainty in parameters, a Bayesian approach will also facilitate the incorporation of information from other sources. Recent methodology has been developed to allow for the weighting given to such prior beliefs to be adaptively changed in response to the data observed in the current trial,⁶⁹ where the weighting will decrease as the observed data becomes less commensurate with the historical data.⁷⁰ Computationally, the Bayesian approach will require the use of Markov Chain Monte Carlo (MCMC) methods and, as a result, may present difficulties with respect to timely analysis.

3.2.2 Sample size

In addition to leading to complex statistical models, clustered trial designs present difficulties when interpreting the notion of sample size. In phase II designs, sample size is commonly used as a metric with which to compare the quality of any two trial specifications. Typically, the setting of trial parameters is done in such a way as to minimise sample size subject to type I and II error rates remaining within nominal bounds. Trials with clustering, however, will induce further measures to be minimised by the trial designer. For example, the OK-D study involves k research nurses, each of whom has been assigned mpatients. We wish for both k and m to be kept as low as possible whilst ensuring error rates remain within nominal bounds, but these measures are clearly in conflict – reducing one will require increasing the other in order to maintain error rates.

One approach to this problem is to combine the measures into a single weighted combination. This may be achieved through translating each measure to a common scale, such as cost.⁷¹ This would then allow one to focus on minimising cost (subject to constrained error rates). Where such a transformation is not available or appropriate, one may still employ a weighted combination method, although it may be challenging to elicit and represent the preferences of the decision maker(s) in this form. An alternative approach would be to set a limit on one measure, so that the other may be minimised subject to this constraint. For example, one could look for the trial with smallest m such that $k \leq 5$ and error rates remain within nominal bounds. Both methods induce an ordering of preference on the set of all possible trial specifications, thus defining the best. An alternative approach would be to consider the minimisation of m and k as independent measures, and attempt to identify a set of trial specifications representing a range of potential trade-offs between them whilst maintaining error rates within nominal bounds. This technique, known as Pareto optimisation,⁷² may be a more realistic reflection of trial design in practice, where it is common for a range of scenarios and options to be explored and presented to the decision maker(s) before a final trial specification is selected. More generally, it should be noted that the error rates of trial configurations are measures which we aim to minimise, and that a constrained approach is typically used (e.g. requiring $α < 0.1$ ) in addressing them. The benefits of relaxing error rate constraints to encourage the designer to trade-off different performance measures has been illustrated previously.⁷³ Furthermore, this general framework would extend easily to allow for further objectives to be specified. For example, as illustrated by Jung et al.,⁷⁴ the specification of a two-stage trial following the Simon²⁵ design could consider minimising both the expected sample size and the maximal sample size simultaneously.

3.2.3 Design space

In section 3.1.2, additional complexity in the specification of decision rules was shown to lead to a more difficult optimisation problem due to an increased number of parameters. Similarly, increasing complexity in terms of multilevel structures due to clustering will also require further parameters or dimensions to be considered when searching for optimal trial specifications,⁷¹ and so again it may be beneficial to implement sophisticated optimisation routines rather than exhaustively searching through all possible options. Practically, the impact of increased numbers of design parameters may be limited by bounds on their values. For example, the number of therapists available to deliver an intervention may be fixed, and so when designing the trial one will not have to consider its variation. While such a feature will lead to a simpler optimisation problem, it may also lead to difficulties with regards to parameter estimation and inference.

4 Conclusions and further work

Currently, guidelines for the development and evaluation of complex interventions suggest that early phase experimental work focuses on assessing the feasibility and optimal design of a planned phase III definitive RCT. This contrasts with the drug development setting, where phase II trials are commonly used as a screening mechanism, designed to assess the efficacy of a new treatment and decide if a phase III trial will be worth conducting.

In this paper we have considered how the efficacy of complex interventions could be assessed in the context of current early phase feasibility or pilot studies. With reference to a range of phase II trial designs, challenges to their adaptation to the complex intervention setting have been discussed. The presence of multiple endpoints on which a decision must be based, and the clustering of outcomes in multilevel data structures, have been reviewed in detail. Two recurring themes have emerged. Firstly, the potential benefits of Bayesian methods have been highlighted in the context of decision theoretic approaches to trial design, incorporating uncertainty in trial design parameters and providing robust methods of estimation when only limited numbers of clusters are available. Secondly, we have emphasised the practical need for a sophisticated approach to defining and locating the ‘optimal’ trial specification for a given problem, in order that the best possible trial specification can be determined in a timely and robust manner.

In addition to difficulties arising from multiple endpoints and clustering, there remain several other features which could be explored in future work. One could consider widening the set of decisions of the study from the simple {stop, go} to encompass the refining of the intervention’s components or parameters,⁶ or to include the design specification of the planned phase III study in response to feasibility findings. Further details such as the impact of learning curves could be explored, and the appropriate place of efficacy assessment in the larger development and evaluation framework proposed by the MRC¹ should be considered.

Footnotes

Acknowledgements

The authors wish to thank the OK-Diabetes study team (NIHR HTA grant reference 10/102/03) for helpful discussions that shaped the scope of this paper.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Duncan Wilson is funded by a Research Methods Fellowship from the National Institute for Health Research.

References

Craig

Dieppe

Macintyre

. Developing and evaluating complex interventions: The new Medical Research Council guidance. BMJ 2008; 337: a1655–a1655.

Walwyn RAE. Therapist variation within meta-analyses of psychotherapy trials. University of Manchester, UK, 2010.

Petticrew

. When are complex interventions ‘complex’? When are simple interventions ‘simple’? Eur J Public Health 2011; 21: 397–398.

Campbell

Fitzpatrick

Haines

. Framework for design and evaluation of complex interventions to improve health. BMJ 2000; 321: 694–696.

Barkun

Aronson

Feldman

. Evaluation and stages of surgical innovations. Lancet 2009; 374: 1089–1096.

Collins

Murphy

Nair

. A strategy for optimizing and evaluating behavioral interventions. Ann Behav Med 2005; 30: 65–73.

Collins

Chakraborty

Murphy

. Comparison of a phased experimental approach and a single randomized clinical trial for developing multicomponent behavioral interventions. Clin Trials 2009; 6: 5–15.

Arain

Campbell

Cooper

. What is a pilot or feasibility study? A review of current practice and editorial policy. BMC Med Res Methodol 2010; 10: 67–67.

Thabane

Chu

. A tutorial on pilot studies: The what, why and how. BMC Med Res Methodol 2010; 10: 1–1.

10.

Billingham

Whitehead

Julious

. An audit of sample sizes for pilot and feasibility trials being undertaken in the United Kingdom registered in the United Kingdom Clinical Research Network database. BMC Med Res Methodol 2013; 13: 104–104.

11.

Julious

. Sample size of 12 per group rule of thumb for a pilot study. Pharmaceut Stat 2005; 4: 287–291.

12.

Browne

. On the use of a pilot sample for sample size determination. Stat Med 1995; 14: 1933–1940.

13.

Lancaster

Dodd

Williamson

. Design and analysis of pilot studies: Recommendations for good practice. J Eval Clin Pract 2004; 10: 307–312.

14.

Arnold

Burns

Adhikari

. The design and interpretation of pilot trials in clinical research in critical care. Crit Care Med 2009; 37: S69–S74.

15.

Lee

Whitehead

Jacques

. The statistical interpretation of pilot trials: Should significance thresholds be reconsidered? BMC Med Res Methodol 2014; 14: 41–41.

16.

Bland

. Cluster randomised trials in the medical literature: Two bibliometric surveys. BMC Med Res Methodol 2004; 4: 21–21.

17.

Walwyn

REA

Roberts

. Therapist variation within randomised trials of psychotherapy: Implications for precision, internal and external validity. Stat Meth Med Res 2010; 19: 291–315.

18.

Roberts

. Design and analysis of clinical trials with clustering effects due to treatment. Clin Trials 2005; 2: 152–162.

19.

Serlin

Wampold

Levin

. Should providers of treatment be regarded as a random factor? If it ain’t broke, don’t “fix” it: A comment on Siemer and Joormann (2003). Psychol Meth 2003; 8: 524–534.

20.

House A, Ajjan R, Bryant L, et al. Managing with learning disability and diabetes. HTA protocol, http://www.nets.nihr.ac.uk/projects/hta/1010203 (2014, accessed 1 June 2015).

21.

Mariani

Marubini

. Design and analysis of phase II cancer trials: A review of statistical methods and guidelines for medical researchers. Int Stat Rev 1996; 64: 61–88.

22.

Fleming

. One-sample multiple testing procedure for phase II clinical trials. Biometrics 1982; 38: 143–151.

23.

A'Hern

. Sample size tables for exact single-stage phase II designs. Stat Med 2001; 20: 859–866.

24.

Brown

Gregory

Twelves

. Designing phase II trials in cancer: A systematic review and guidance. Br J Cancer 2011; 105: 194–199.

25.

Simon

. Optimal two-stage designs for phase II clinical trials. Control Clin Trials 1989; 10: 1–10.

26.

Simon

Wittes

Ellenberg

. Randomized phase II clinical trials. Cancer Treat Rep 1985; 69: 1375–1381.

27.

Jung

S-H

. Randomized phase II trials with a prospective control. Stat Med 2008; 27: 568–583.

28.

Bryant

Day

. Incorporating toxicity considerations into the design of two-stage phase II clinical trials. Biometrics 1995; 51: 1372–1383.

29.

Stallard

Thall

Whitehead

. Decision theoretic designs for phase II clinical trials with multiple outcomes. Biometrics 1999; 55: 971–977.

30.

Sill

Rubinstein

Litwin

. A method for utilizing co-primary efficacy outcome measures to screen regimens for activity in two-stage Phase II clinical trials. Clin Trials 2012; 9: 385–395.

31.

Tan

S-B

Machin

. Bayesian two-stage designs for phase II clinical trials. Stat Med 2002; 21: 1991–2012.

32.

Brunier

Whitehead

. Sample sizes for phase II clinical trials derived from Bayesian decision theory. Stat Med 1994; 13: 2493–2502.

33.

Stallard

. Optimal sample sizes for phase II clinical trials and pilot studies. Stat Med 2012; 31: 1031–1042.

34.

Conaway

Petroni

. Bivariate sequential designs for phase II trials. Biometrics 1995; 51: 656–664.

35.

Conaway

Petroni

. Designs for phase II trials allowing for a trade-off between response and toxicity. Biometrics 1996; 52: 1375–1386.

36.

Robertson

Wright

Dykstra

. Order restricted statistical inference, New York: John Wiley and Sons, 1988.

37.

Tournoux

De Rycke

Médioni

. Methods of joint evaluation of efficacy and toxicity in phase II clinical trials. Contemp Clin Trials 2007; 28: 514–524.

38.

Sargent

Chan

Goldberg

. A three-outcome design for phase II clinical trials. Control Clin Trials 2001; 22: 117–125.

39.

Lindley

. The choice of sample size. J Roy Stat Soc Ser D 1997; 46: 129–138.

40.

Spiegelhalter

Freedman

Parmar

MKB

. Bayesian approaches to randomized trials. J Roy Stat Soc Ser A 1994; 157: 357–416.

41.

Berger

. Statistical decision Theory and Bayesian analysis, 2nd ed. New York: Springer-Verlag, 1985.

42.

Thall

Nguyen

Braun

. Using joint utilities of the times to response and toxicity to adaptively optimize schedule-dose regimes. Biometrics 2013; 69: 673–682.

43.

Adcock

. Sample size determination: A review. J Roy Stat Soc Ser D 1997; 46: 261–283.

44.

Wolpert

Macready

. No free lunch theorems for optimization. IEEE Trans Evolut Comput 1997; 1: 67–82.

45.

Mebane

Walter

Sekhon

. Genetic optimization using derivatives: The rgenoud Package for R. J Stat Softw 2011; 42: 1–26.

46.

Roberts

. The implications of variation in outcome between health professionals for the design and analysis of randomized controlled trials. Stat Med 1999; 18: 2605–2615.

47.

Schnurr

Friedman

Engel

. Cognitive behavioral therapy for posttraumatic stress disorder in women: A randomized controlled trial. JAMA 2007; 297: 820–830.

48.

Goldstein

. Multilevel statistical models, 3rd ed. London: Arnold, 2003.

49.

Cohen

Mannarino

. A treatment outcome study for sexually abused preschool children: Initial findings. J Am Acad Child Adolesc Psychiatry 1996; 35: 42–50.

50.

Wampold

Serlin

. The consequence of ignoring a nested factor on measures of effect size in analysis of variance. Psychol Meth 2000; 5: 425–433.

51.

Siemer

Joormann

. Power and measures of effect size in analysis of variance with fixed versus random nested factors. Psychol Meth 2003; 8: 497–517.

52.

Siemer M and Joormann J. Assumptions and consequences of treating providers in therapy studies as fixed versus random effects: Reply to Crits-Christoph, Tu, and Gallop (2003) and Serlin, Wampold, and Levin (2003). Psychol Meth 2003; 8: 535–544.

53.

Lee

Thompson

. The use of random effects models to allow for clustering in individually randomized trials. Clin Trials 2005; 2: 163–173.

54.

McCulloch

Searle

Neuhaus

. Generalized, linear, and mixed models, 2nd ed. New York: Wiley, 2008.

55.

Landau

Stahl

. Sample size and power calculations for medical studies by simulation when closed form expressions are not available. Stat Meth Med Res 2013; 22: 324–345.

56.

Donner

Klar

. Design and analysis of cluster randomization trials in health research, London: Arnold Publishers, 2000.

57.

. Efficient error determination in sequential clinical trial design. J Comput Graph Stat 2008; 17: 925–948.

58.

Browne WJ, Lahi MG and Parker RM. A guide to sample size calculations for random effect models via simulation and the MLPowSim Software Package, 2009.

59.

Hooper

. Versatile sample-size calculation using simulation. STATA J 2013; 13: 21–38.

60.

Smith

Marshall

. Importance of protocols for simulation studies in clinical drug development. Stat Meth Med Res 2010; 20: 613–622.

61.

Jones

. A taxonomy of global optimization methods based on response surfaces. J Global Optim 2001; 21: 345–383.

62.

Roustant

Ginsbourger

Deville

. DiceKriging, DiceOptim: Two R Packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J Stat Softw 2012; 51: 1–55.

63.

Martinez C. Ruben, BayesOpt: A Bayesian optimization library for nonlinear optimization. Experimental Design and Bandits. CoRR, 2014. abs/1405.7430.

64.

Campbell

Elbourne

Altman

. CONSORT statement: Extension to cluster randomised trials. BMJ 2004; 328: 702–708.

65.

Spiegelhalter

. Bayesian methods for cluster randomized trials with continuous responses. Stat Med 2001; 20: 435–452.

66.

Turner

Omar

Thompson

. Bayesian methods of analysis for cluster randomized trials with binary outcome data. Stat Med 2001; 20: 453–472.

67.

Turner

Toby Prevost

Thompson

. Allowing for imprecision of the intracluster correlation coefficient in the design of cluster randomized trials. Stat Med 2004; 23: 1195–1214.

68.

Turner

Thompson

Spiegelhalter

. Prior distributions for the intracluster correlation coefficient, based on multiple previous estimates, and their application in cluster randomized trials. Clin Trials 2005; 2: 108–118.

69.

Hobbs

Carlin

Sargent

. Adaptive adjustment of the randomization ratio using historical control data. Clin Trials 2013; 10: 430–440.

70.

Hobbs

Carlin

Mandrekar

. Hierarchical commensurate and power prior models for adaptive incorporation of historical information in clinical trials. Biometrics 2011; 67: 1047–1056.

71.

Hox

. Multilevel analysis: Techniques and applications, Hillsdale, NJ: Lawrence Erlbaum Associates, Inc., 2002.

72.

Miettinen

. Nonlinear multiobject multioptimization, Dordrecht: Kluwer Academic Publishers, 1998.

73.

Khan

Sarker

S-J

Hackshaw

. Smaller sample sizes for phase II trials based on exact tests with actual error rates by trading-off their nominal levels of significance and power. Br J Cancer 2012; 107: 1801–1809.

74.

Jung

S-H

Carey

Kim

. Graphical search for two-stage designs for phase II clinical trials. Control Clin Trials 2001; 22: 367–372.