17th Annual University of Pennsylvania Conference on statistical issues in clinical trials – Covariate adjustment in randomized clinical trials: New methods and applications (Afternoon panel discussion)

Abstract

ALISA STEPHENS: In addition to our three presenters, we are joined by Drs Frank Harrell and Dylan Small. Frank Harrell is a professor of Biostatistics at Vanderbilt University. He is an expert biostatistics advisor to the Food and Drug Administration (FDA) Center for Drug Evaluation and Research, a fellow of the American Statistical Association and winner of the Association’s W.J. Dixon Award for Excellence in Statistical Consulting. Dylan Small is Professor and Chair of the Department of Statistics and Data Science and holds the named Universal Furniture Chair at the Wharton School. He is a senior fellow of the Leonard Davis Institute, an ASA fellow and received the IMS Medallion lecture. His research interests include causal inference, design and analysis of experiments, measurement error, medicine, and economics.

FRANK HARRELL: I really enjoyed the three talks, and it’s always a pleasure when you hear intelligent people speak about good research and putting research to really great uses in the world. I think there’s a lot of general issues to discuss. And I don’t know how deeply to get into these because everyone has opinions about how statistics should be practiced and where we should be spending our time. And I definitely have a lot of opinions about that. But one of my opinions is about the role of machine learning. We’re just now really understanding it. And a couple of the speakers used machine learning in their methods. We’re just now getting close to a sufficient number of comparative studies of machine learning versus statistical models. It’s pretty clear that a lot of the machine learning advantages were hyped. The more studies we see that compare regular statistical models with machine learning, the better statistical models look if you’re not doing imaging analysis or something that has a high signal-to-noise ratio. In biomedical research, we usually have very low signal-to-noise ratios, nothing like imaging analysis, nothing like a self-driving car, which is easier to implement than what we do in the presence of high patient-to-patient variability.

When statistical models are used in a smart way, they actually are far more flexible than some will admit. When you think about covariate adjustment using statistical models, you can actually be quite flexible in the ways that matter. There’s a lot of ways you can make mistakes, but in what really matters for getting precision and getting an answer that is reliable and getting good confidence intervals and so on, statistical models do a pretty amazing job. So another general comment about how statistics is practiced is that whenever you’re doing something new or complicated, I think you’re obligated to present comparative studies with something that’s more like my age-group methodology. And we really don’t see enough comparative studies of old and new statistical methods. I think these three speakers are not really bucking the trend enough for my taste in terms of doing the kind of comparative studies that show whether the new methods are worth all that trouble.

This relates to a third item, which is we’re at a crossroads right now because we’ve been preaching covariate adjustment to the clinical trial world for decades and decades. And we’re starting to make progress; the guidance document that came out from the FDA was a really big plus in this way. But now we’re giving mixed messages to the clinical trial world. We’re telling them covariate adjustment is fantastic. But now we’ve got causal calculus. We need to bring that into the equation. We need to bring in machine learning. We need to bring targeted maximum likelihood estimation into the equation. And it creates kind of uncertainty in my biased view because we’re telling people that covariate adjustment is underutilized by a huge amount. And by the way, to use it, you need to unlearn everything you learned, and you need to learn something totally new and very complex. And I don’t think that’s the case at all. So I would like to see a stronger case made for going to the amount of trouble that we’re seeing the new methodology go to.

I’m a bit biased by my age, but also by the fact that I’ve spent my career running simulations and trying to find out how things work. And I see the constant underestimation of model uncertainty as a problem in statistical inference. There was a paper that came out a couple of years ago that studied targeted maximum likelihood estimation in terms of whether it gives valid standard errors of the treatment effect or whatever your variable of interest effect.¹ This paper found that it did not. It gave pretty grossly underestimated standard errors. And then, they had a fix for that, which was to enclose targeted minimum loss-based estimator estimate (TMLE) in a bootstrap loop. They had to do 100 repetitions. So when the TMLE took a week to run, now you’ve got it running 100 weeks unless you expand it to just lots of parallel processing, even more than you did before. One needed an outer bootstrap loop to fix the problem with the standard error due to the targeted maximum likelihood estimator. And this is something that is also related to studies of doubly robust estimators. Some researchers have studied doubly robust estimators in kind of real-world settings and found that the benefits are not quite worth the pain in many cases. I’m sure there are exceptions to that.

And so that’s the high-level comments I wanted to make, and now I have some comments about the specific papers. I’m always a stickler about nomenclature/terminology. One thing that a couple of the speakers talked about was Type I error. I’m trying to convince the world that the word error is an inappropriate word in that context because it has nothing to do with an error. It’s just the probability of triggering an assertion of effect when any assertion would be, by definition, wrong. So that’s all that alpha is. I call it Type I assertion probability because when you call it an error probability, you’re endowing it with value that it doesn’t have. It doesn’t have anything to do whatsoever with the probability of making the wrong decision, which is what we’re all about as decision-makers. Bingkai used the term re-randomization. This might be something I’ve missed in the literature, but what I’ve heard the word re-randomization used for is in the area of SMART trials where you’re not changing your mind about the initial randomization. You’re actually randomizing people, finishing a phase of study, and then, you re-randomize them. So I’m not sure that re-randomization is exactly the right term. I think everything you’re doing is really valuable. But I think there’s probably a better term for that.

I think the other really big picture item is what the estimand should be for a randomized clinical trial. I follow a lot of arguments on social media about causal calculus, and I’m involved in arguments with Judea Pearl and several other people who really know what they’re talking about with regard to causal calculus. There seems to be two camps. There are people that understand experimental design, and there are people that understand causal calculus, and there’s not very many people who understand both of those simultaneously. And if there were people that understood both of those, we would be much farther along because a lot of what’s being put forward into the causal calculus world is based on a misunderstanding of how randomized trials are designed. Now, an exception to this is if you have a linear model and you have no interactions with treatment. In that very specific case, the unconditional treatment effect equals the conditional treatment effect, and that equals a population average treatment effect. So in the special case of linear models without interaction, there’s really not an argument like there is with non-linear models. Once you get into absolute risk reduction models, hazard ratios, and all of these things, it’s a different ballgame.

The idea of being interested in some sort of population risk difference is really at odds with the way clinical trials are designed. An average treatment effect would be a relevant thing to ask for if you’re interested in group decision-making. If there were no risk factors, so the absolute risk reduction was constant for every type of person, it would also be a relevant thing to estimate. It would also be relevant to estimate if the randomized trial sample was a random sample from the population. There have been several papers that say, “Here is the causal calculus we need for this randomized trial setting. We’re going to assume we have a random sample from the population.” Well, what does that mean? That means you’re going to select people from the population and you’re going to strongarm them to be in your clinical trial and take some drug they don’t want to take or get some surgery they’re not interested in. If you could coerce people to get into a clinical trial, you could select random samples from the population, and then, you’re actually satisfying what causal calculus needs to happen to be interested in the average treatment effect. That’s not the way clinical trials are done at all.

The only time you get samples from a population is when you’re doing very simple clinical trials that involve the Internet, certain behavioral interventions, where you can really – without risk – just grab people, and they’ll agree because it’s a 5-min participation. In clinical trials of drugs and devices, that never applies. It’s always a volunteer effort. You cannot coerce somebody to enroll in a clinical trial. So you’re never going to get a sample in a clinical trial that’s representative of a population. When you look at how people are stating estimands that involve average treatment effect, it’s completely at odds with the way clinical trials are designed. And so the estimand we need to know about is more related to specific patients or specific patient types. I think it was Bingkai who had what he called ANCOVA 2, a model that allows all the interactions between covariates and treatments to be present. If you look at the estimand that we really need, would it be the treatment effect for a specific type of patient?

For that specific estimand, and the estimand for a specific type of patient, would be constant if there’s no interaction, and it would be non-constant if there are interactions. So that is the estimand we need, but the ANCOVA 2 is going to be a method that for that estimand maximizes the mean squared error. It’s going to give you the worst mean squared error you could possibly have because it’s got too many parameters for that conditional estimand. And so a strategy where you might say, Here’s a bunch of interacting factors. We’re going to use Bayesian priors or we’re going to use penalized maximum likelihood estimation to shrink the interactions to what will optimally cross-validate or give optimal AIC, whatever criteria you want, that’s going to provide superior mean squared error. That’s not going to assume that your clinical trial is a random sample from the population. It’s going to give you the conditional estimand you need. This harks back to the morning discussion about that odds ratio of 8 versus, I think it was 4.8. If you know one thing about patients, you get an odds ratio of 8, and if you knew the opposite, it’s also 8; then, 8 is the answer. So odds ratios are interesting quantities in statistics. They don’t logically lead to marginalization.

This is another vote to me for conditional estimates. A general point I’d like to make also is that any method that’s proposed that doesn’t come with a sample size calculator – a sample size estimation procedure, I don’t trust that method very much. There’s a lot of methods that are really cool, but when you calculate the sample size needed for them to be stable and accurate, the sample size is bigger than what we encounter in a lot of our clinical trials. So I’ll put a challenge to everyone developing methods that they need to accompany that with a sample size calculator or simulator, which means you need to have a criterion to judge success against. And if that criterion is certain mean squared error of a treatment effect or whatever it is, what’s the sample size needed to achieve that? I attended a talk a few weeks ago in which the speaker acknowledged that the sample sizes they had in their example data were not adequate for the method they were proposing people would use. So we have to bring reality back into the picture.

I have a chapter on analysis of covariance in my online book, Biostatistics for Biomedical Research. And it has a Bayesian SAP for covariate adjustment that uses flexible statistical models. I submit that if you looked at that particular plan, which handles certain complexities, like nonlinearities and so on, that that plan is really very simple. It solves most of the problems that we’ve set out to solve today, except it doesn’t deal with missing data. And so many things that we develop in statistics, to me, are overly complicated just because we work so hard not to be Bayesian. So I think I’ll stop here. I think I’ve caused enough trouble already.

DYLAN SMALL: Thanks, Frank. A lot of insightful comments, that’s a hard act to follow. I’ll just add a few things. Yes, first, congratulations to all the speakers. I thought they were all really interesting talks. I’ll just pose a few questions. One thing I wondered in Ting’s talk, and that Laura mentioned as well. I was intrigued that you said that the cross-fitting variance estimator was always conservative. And the standard seemed to work well in your simulations.

I saw, Ting, you mentioned something about if you use machine learning methods in your covariance adjustment that you would have to use cross-fitting. I wondered if you had any experience with whether it was conservative or not. I wondered in terms of, Ting, your covariate adjustment, kind of, I guess, what if you do get – if you have a situation – I mean, people often want to see the unadjusted estimate. If it is different than the adjusted estimate, then you’d like to be able to transparently show people why. And I wondered, in terms of the methods you’ve developed, is there a transparent diagnostic that you can show people where the covariate adjustment changed the estimate and why it’s better in some sense? So then moving on to Bingkai’s talk. I know there’s an older term for this type of randomization – Frank, do you know what is it called?

FRANK HARRELL: Constrained.

DYLAN SMALL: Yes, covariate-constrained randomization. You can think of it as you generate initially all your possible randomizations, and then, you can think about which ones you would consider acceptable. You decide that set in advance, that was, I think the original idea. Re-randomization is just a different way to implement this. It’s really the same thing. But I think the advantage of thinking about first generating the randomizations in advance is you can – I mean, you could discover if there’s any – I mean, you could look at your set of acceptable randomizations. It’s sort of a more transparent way to do it.

But it also can include some constraints, for example, especially cluster randomized trials, you don’t want the travel time to be too large to go to different sites and stuff. I mean, another issue I wondered about is – and maybe this doesn’t come up in the things you were talking about because they’re not really clustered. But I mean, if you did come up with the acceptable randomizations being a sort of small set, I mean, then there’s issues of, for example, what if a pair of units is always going to end up in the same arm? You’d want to know that in advance. And then, moving on to Laura’s talk. So probably could guess this, Laura. A lot of the work I did in the context of these kinds of things is randomization inference, particularly in the context of cluster randomized trials. I saw it in your paper. You cited our work, and you said that was just for the sharp null. But I wondered if you thought about it – there have been some recent developments where methods are exact for the sharp null that are asymptotically correct for the weak null.

Just wondered your thoughts about those methods. Laura, I wonder if you had any experience with kind of – you talked about doing one covariate in the trial with 32 units. And you mentioned you can use simulations to figure out what to do. Can you give any guidance in terms of how many covariates to look at? Have you run into situations where they can break down or be misleading? Is there any guidance you can give about sort of diagnostics to figure out when do I need to do more simulations, when might there be a problem?

Ting Ye: First, for Dylan’s question, the first is about cross-fitting the variance estimator. So in our work, we did try binary outcome, count outcome; then, we compared machine learning methods like random forest or ensemble methods, compared it with standard generalized linear model (GLM), with logistic regression and Poisson regression. So first, echoing Frank’s point that, yes, in a lot of scenarios, we don’t see big improvement from machine learning methods, especially for binary outcome. For count outcomes, there is some more improvement. So yes, we need to define pretty severe violation to logistic regression, Poisson regression in order for machine learning to work well. But when we do that, we need to do cross-fitting for the variance estimation and for the point estimate. In our simulation, the coverage is fine. The estimators are standard cross-fitted augmented inverse probability weighting (AIPW) estimators. The coverage is close to 95%, so that works well. But one thing I want to add here is that, for a cross-fitting, because there is randomness in the cross-fitting, like dividing data in five folds, it’s random. So what we did is to – also recommended in the Chernozhukov double de-biased machine learning (DML) paper – is to repeat this several times to get median of the cross-fitted estimator, get median of the variance so that performs better.

The second question from Dylan is the unadjusted and adjusted methods. So basically, in our work, covariate adjustment does not change the estimand. They are always the same estimand as the unadjusted method, but they are more precise. So I think that’s exactly one advantage of this covariate adjustment method. It doesn’t change the estimand. So the estimand is predefined to be clinically interesting, then adding covariate adjustment just improves efficiency. I definitely agree that – so it would be beneficial to include, for example, the covariate-adjusted method as the primary analysis, and then also do the unadjusted analysis to see whether the results are sensitive. So that would be beneficial to do. These are my responses to Dylan.

ALISA STEPHENS: Frank as well, please.

TING YE: So Frank has many questions. First, probably about the estimand question. It’s a big question, quite controversial. And just want to add perhaps one perspective to support the unconditional estimand, that I agree that the trial sample is never a random sample from clinical population. But I think of trial population as a random sample from the target trial population that are potentially able to be recruited into the trial. We start from there. And one thing that I want to say perhaps is everyone think of some trial population because the contrast, the conditional treatment effect is good for contrast. But in a trial, we also report specific summaries like response rate or survival probabilities. So those are population summaries for some population. And I think of it as a trial population. And the other thing is, for example, the estimand framework set up by the ICH E9(R1) (European Medicine Agency Addendum on Estimands and Sensitivity Analysis), one of the attributes for the estimand is also population, and then getting a population summary. So that’s another view of mine.

The last point I want to make is I felt like the unconditional estimand makes results more coherent with population summaries. For example, in drug label with binary endpoint, I see those summaries of response rate for the treatment arm 30%, response rate for control arm 20%, and then get a conditional odds ratio and having a p value from logistic regression, which is not really coherent with the population summary that’s provided there. So yes, those are just a few points I want to add for the estimand. And finally, I already mentioned a little about my take for the machine learning approaches. They do not work that much better in standard clinical variables. But thinking forward, maybe in the future in clinical trials, we might include those image data to use them to correlate them with outcomes, or use very complex biomarker data, then potentially the machine learning methods will have a place there. Thank you.

ALISA STEPHENS: Thank you. Dr Balzer?

LAURA BALZER: Thank you for the comments. First, I think I want to push back on one comment made by Dr Harrell, that causal inference people have a fundamental misunderstanding of how randomized trials are designed. As I hopefully showcased by my talk, I have been embedded since 2010 in the design and the analysis for randomized trials in a variety of settings with a variety of endpoints, and mostly in resource-limited settings. But the methods developed here are general. I think it also relates to defining what is our estimand of interest. I don’t sit at home behind my computer and come up with this on my own. It is truly multidisciplinary, collaborative, multinational work to figure out: What do we want to do? What is the health outcome we want to make an impact on? We want to end the HIV epidemic and improve community health. How do we get there with the tools that are existing? How do we deliver it to people that need it and respond to their unique needs?

And our goals change over time as new methods come online. By “methods”, here, I mean both biomedical HIV prevention methods and antiretroviral therapy (ART) as well as statistical tools developed as we figure out how to actually design better trials with real-world constraints. Bingkai talked about some of the very cool designs that we can do with re-randomization. Cluster randomized trials are a different ballgame than individually randomized, tightly controlled trials. They have lots of fun quirks that I’ve, again, spent 15 years or so working on. I do want to push back on the idea that causal inference people don’t have an understanding of how trials actually work.

FRANK HARRELL: Could I just add one thing? I’m sorry to interrupt, but I was speaking to the people I interact with in social media and I haven’t had the good luck to interact with you. [laughter]

LAURA BALZER: Next, I’d like to say, again, I don’t come up with the estimand in isolation. I teach a multiple courses on causal inference anchored on the Causal Roadmap, which is just the scientific method.^2,3 The first step is asking: What do I want to learn? And that’s the hardest thing. At the end of the year, my students do a final project with real data. And they’re like, “Laura, I have this awesome data set. I really want to figure out the effect of this on that. But how do I come up with the actual causal question? What is my estimand of interest?” Right? That’s a hard question. I have the privilege of being a biostatistician. I don’t come up with the question on my own. My colleagues, collaborators, and I come up with the question and the design together and under real-world constraints. I sometimes have a very, very fancy proposal for the design, “We should do this stratified randomization because we can guarantee balance on this and that.” And they’re like, “No, that’s just not possible to do in our setting for these real-world reasons.” And that’s great. It pushes forward both the statistics and the on-the-ground implementation.

Now, I’m looking at Dr Tchetgen Tchetgen, because he worked on one of the large Universal HIV Test and Treat (UTT) trials, where we actually did have population-level estimates.⁴ In SEARCH at least – I think in Ya Tsie – we were able to take a rapid census of our study communities. We knew who was there, and we knew things about them, and we knew who was missing. So we could actually estimate population-level effects, accounting for the fact that not everyone enrolled in our trial and that people who stayed in for the whole trial can be different than the ones who dropped out. I’m not sure if I answered the questions I was supposed to. I will get off my soapbox and just say it’s been such a privilege and honor to work on developing these methods that go hand-in-hand with applied practical papers, again, all motivated by real-world data examples where the existing methods are insufficient or don’t exactly match what was done on the ground.

ALISA STEPHENS: Thank you.

Laura Balzer: Dylan, do you want me to answer your questions? I’m really excited about randomization inference. I was hoping to talk with Michael Rosenblum because he’s working on this too. It can be incorporated in the same framework and is super cool. So let’s do it! Regarding guidance on the number of covariates and aggressiveness, I think this is where the simulations really come into play. We want a general rule that’s like “for logistic regression, one coefficient for every 10 outcomes,” right? Something that you can just grab onto. But again, the real world is complicated, and things are messy. You might have a rare outcome or it’s imbalanced. This is where plasmoid simulations that are more realistic can help inform these decisions and, as you said, give you red flags. In SEARCH, we had 32 communities. So I wasn’t running the LASSO. It was an adjustment for a single covariate, just one. Asking our procedure “Can I just get a little bit more power, please, please, please?” while making sure that we are maintaining the nice statistical properties. Altogether, I advocate for simulations in both observational studies and randomized trials. The real world’s messy. Things are going to come up, and they’re going to inspire new methods in doing so.

ALISA STEPHENS: Thank you. Dr Wang?

BINGKAI WANG: I really regret I didn’t have a pen and paper in my hand. So Laura wrote two pages, but I only have my memory to answer all these questions. I will keep it short. The first thing I want to say is why I chose to talk about re-randomization. I know many of the audience may not be familiar with this randomization scheme. Previously, I did work about the stratified randomization, which is well cited and also referred to in the FDA guidance. The reason I want to talk about something new here is because when I do cluster-andomized trials, I found that this design is commonly used in cluster randomized trials and also the social sciences. It may be a good opportunity to bring it to an audience of clinical researcher not doing this kind of study. Maybe something new for you to learn and to consider in your future practice. And that’s why we should do, right, in a conference to bring in new ideas, not only to educate people what you should do in your practice. That’s why I’m very excited to talk about this new idea.

And of course, I fully understand there are many issues. Like Dylan said, when you have two small sets for randomization, you may not be truly random. That’s obviously an issue. And people are trying to address that. And this is my first point. And I really appreciate your patience to learn this new stuff. And the second point is about machine learning. Again, this is related to my first point. We’re here to talk about new things. And my analogy is – we are currently excited about the self-driving car. There are so many questions, so many problems with this new technology. And of course, I fully agree with Frank that before it is on the market, we need to do a lot of empirical study to make sure it is safe in all kinds of situations. But he’s still excited to introduce this new technology, new techniques in this group of people who are in the front line of developing new technology. So that’s why I think machine learning still represents a very promising future in randomized trials. Currently, we are probably only on a very small scale in the traditional tabular data era of analyzing clinical trials. But what if in the future, in 10 years, 15 years, are we going to incorporate multimodal data in randomized clinical trials (RCTs)? Is that possible? When that happens, are we prepared to do that? I think this discussion today will be very useful, although it may not be very thorough in discussing all these techniques, but it’s a great discussion to bring this up, to really have us think through it and see what its benefit is, what are the current issues, so that we can be better prepared in the future. Thank you.

ALISA STEPHENS: Well, with that, we’ll open up comments from the floor.

DEVAN MEHROTRA: Devan Mehrotra from Merck. Ting, I had a question for you. And I think Bingkai, you may also add if you want. So in what you call the Analysis of Heterogeneous Covariance, and I think Bingkai, you call the ANCOVA 2, just connecting to the morning session, if you have 1:1 randomization and you have a no-interaction model, so in which case it’s the ANCOVA or the ANCOVA 1, I think. So there, the usual standard errors are just fine. Now, what happens if you have 1:1 randomization? You now have an interaction term in the model. Can you still use the usual standard errors, or do you have to do something else? Because I know when you have the interaction model, you have to center the covariates. Does that cause any issues? And just intuitively, if you cannot use the usual standard errors, what is the intuition behind why it doesn’t work? And will it inflate Type I error under the null? Or not Type I error. Frank doesn’t like that. Type I probability.

TING YE: Thanks for the great question. First, I want to be very precise. The model-based ANCOVA works with 1:1 randomization, two arms. With multiple arms, it doesn’t work as well. And then, the second question for our Analysis of Heterogeneous Covariance, if we center the covariates, then we can directly take the coefficient estimate in front of the treatment indicator to be the ANCOVA estimator. But we cannot directly use the standard error because we center the covariate by the population estimate. So that’s random. So that randomness needs to be appropriately accounted for. So essentially, the variance estimator we derive is from influence function, but it’s essentially the same as you get from a sandwich variance estimator and then appropriately account for that overall covariate average randomness into that.

DEVAN MEHROTRA: If I understood your response correctly, the reason you have to make an adjustment is because when you center the covariates, you’re using the X bars ( $\bar{X}' s)$ in the sample, which may not be the same as the true mean of X ( $μ_{X})$ in the population.

TING YE: Exactly.

DEVAN MEHROTRA: So in large samples, is that still an issue?

TING YE: Yes That’s still an issue because that has estimation error with $\bar{X}$ .

DEVAN MEHROTRA: Bingkai, did you want to add anything?

BINGKAI WANG: Oh, I just agree with Ting’s comment.

DEVAN MEHROTRA: Thank you.

ALISA STEPHENS: I’ll take a comment from our online audience.

JENNIFER BOBB: This is a question from Jennifer Bobb (Kaiser Permanente Washington Health Research Institute) directed to Drs Balzer and Wang, about cluster randomized trials. And the question is wondering about when you have a hierarchical data structure, where do you do your analysis at the cluster level or the person level? And then, how do you think about using cluster-level versus individual-level covariates when thinking about the covariate adjustment?

LAURA BALZER: Great. Thank you. Hi, Dr Bobb, wherever you are. This echoes back to what we’ve been saying before: you have to start by figuring out what you want to learn. It is perfectly okay to learn an effect that’s at the cluster level, such as the contrast in HIV incidence, which is defined as the proportion of people in each cluster who acquired HIV. You could also define it as an individual-level risk of acquiring HIV where you pool across all people in the study communities. Both are equally valid causal parameters. You just have to be very clear about what’s the most interesting and relevant. And there are plenty of methods out there that will let you go after either causal parameter. I know Bingkai has worked on them. I worked on them as well, either at the individual-level effect or a cluster-level effect.

Focusing to start on covariate adjustment for efficiency gains (not dealing with missing data), you could use the methods I proposed here and have done before. There’s a paper by Benitez et al. in Stat Med⁵ that shows, again, for a given causal estimand, we can still use the same adaptive pre-specification procedure to choose the combination of covariates – they might be at the individual level; they might be at the cluster level – that gives us the most efficiency for the target parameter. Sometimes you’ll get meaningfully different results; sometimes, you’ll get the same efficiency result, but you won’t harm precision, compared to the unadjusted estimator. Missing data are a whole other fun thing, but I won’t go there.⁵

BINGKAI WANG: I agree with Laura. So the short answer is, yes, you can do either cluster-level or individual-level analysis. Both are valid. The only question is, what do you want about efficiency? There are many different recipes about different models. They have different assumptions. Some are robust. Some are more advanced, like using machine learning. It also depends on the sample size you have. But I would say, please read our paper and also Laura’s paper. You will find many good reviews about what to use and when to use.

ALISA STEPHENS: Yes. Jeff?

JEFFREY MORRIS: I have a simple question. And I don’t work in this area, so forgive me if it’s naive. In general, we’re talking about covariate-adjusted trials. Is there an assumption that these covariates should all be baseline covariates and not covariates that are observed after baseline? That’s my first question. Then, the follow-up is, if that distinction is not known and kind of enforced by people applying these methods, how do we make sure that we’re not conditioning on mediators or colliders or other things and really messing things up by conditioning on the covariates?

TING YE: I can start first. Yeah. So all the results I talk about, and I believe all the covariate adjustment for improved efficiency type of work; they require baseline covariates. I think there’s general consensus, in practice, that people should only use baseline covariates. And they are specified in the statistical analysis plan (SAP). And I felt like the regulatory agencies, for example, could review those.

JEFFREY MORRIS: So that’s generally understood, and there’s not a problem with that, in practice, that you see?

TING YE: I haven’t seen any problem, but I don’t see so many trials.

BINGKAI WANG: I’d like to add that – yes. Go ahead.

ALETTA NONYANE: This is Aletta from Johns Hopkins School of Public Health.

ALISA STEPHENS: And please use the microphone.

Aletta Nonyane: Aletta Nonyane, Johns Hopkins School of Public Health. I just wanted to add to what the gentleman just said now. I think, sometimes, we have to be careful, when we work with our collaborators, to look at and point out if things could possibly be mediators. I think I’ve seen that in practice.

ALISA STEPHENS: Thank you.

BINGKAI WANG: I think Eric talked about this in the morning. Basically, when we just do covariate adjustment in our context, we refer to adjusting for baseline covariates. But in the other context like missing data, it’s probably a good way to account for intermediate outcomes. And for longitudinal studies, we can also model all the outcomes, like intermediate outcomes, and the final outcome together, and regress them on baseline covariates. In this way, you kind of account for, a little bit, intermediate outcomes, but you still account for covariate adjustment.

LAURA BALZER: I think your question was as follows: we all agree on adjusting for baseline covariates, but were they actually measured pre-randomization or not? The gold standard, of course, is to measure everything in a baseline snapshot and then randomize. Again, the reality might be different for a lot of real-world logistical reasons. In cluster randomized trials, we do public lotteries, essentially. We have representatives in both communities come forward. They draw out of a hat. One’s randomized intervention; one’s the control. That happens before we engage in study procedures for individual enrollment and measurement. So we actually do measure some of our baseline covariates after randomization. And that requires a key assumption that the randomization has not impacted the value of those covariates, like baseline HIV prevalence or status. I also think, to speak to your point, this really shows the value of causal graphs and directed acyclic graphs (DAGs) to sit down with your collaborators and think through, “Okay, what was the actual data-generating process? Am I worried that this is a mediator? Should I actually consider this to be a baseline covariate, or is it something that could be impacted by the exposure or the intervention, and I need to think about it in a different way?” Thanks.

STUART POCOCK: I’m just building on your point just made that people think it’s a baseline, and it’s meant to be a baseline, but it can be recorded later. And that’s very rarely mentioned, so thank you for mentioning. The other thing is if we go for variables collected post-randomization, it’s a whole new game. And a whole new day could be devoted to that with very rewarding insights, I think, but it’s not what we’re doing. And it often goes wrong, actually, in subgroup analyses, where people do improper subgroup analyses, where the subgroup is based on something that happened after randomization. And that’s, essentially, naively called what you’d call a mugs game,^† yes, just not doing it right.

ALISA STEPHENS: Any other comments? Yes. Right here.

DAVID WRIGHT: David Wright, AstraZeneca. Final question. Don’t worry. And then, I’ll go back home. But in terms of – I thought I was having a day off from ICH E9(R1) and the definition of estimands, so thank you, Ting, for raising that very important point. And then just a reaction to, Laura, when you were talking about setting a project to students, and they’re saying how difficult it is, but this is what statisticians do all the time in terms of consultation, and it’s an amazing skill to develop. But then that is – so challenge you to say – when you said, “Well, somebody else told me what that was,” that’s not true. We work in a team environment. And if you don’t understand why that’s the clinical question of interest, something’s gone wrong. So it has to be done together with lots of people. And we all have a role to play. And then, obviously, you can pitch in and say, “In this case, we can’t do that trial. We can do this,” or, “We can’t measure that. We can do that.” But that’s all part of our skillset, and that’s why the estimand framework helps with all of that, which you’ve just said – talking about using DAGs to do that. But that’s sort of related to thinking about things that happen after randomization.

LAURA BALZER: Great. I agree. It’s all about team science. We each bring something into the framework. I, alone as the statistician, don’t come up with the big question. Instead, as a team, we work together for hours and days to try to figure out study design, the estimand, and the potential sources of bias. I think the estimand framework and causal models are really helpful to figure out, “Here’s what we want to do. How do we get there balancing all these constraints?”– while incorporating different perspectives from the statisticians, epidemiologists, MDs, implementing partners, and most importantly, our patients and our communities.

DAVID WRIGHT: Thank you.

UNIDENTIFIED AUDIENCE PARTICIPANT: Thank you. I have a question about calculating the variance for the AIPW, that’s connected to the G-computation. So my question is whether the estimators and variance calculation should consider the different randomization scheme.

TING YE: Yes it depends on the randomization scheme. And so the variance should appropriately account for that to take advantage of that. Otherwise, you can be conservative and lose power.

UNIDENTIFIED AUDIENCE PARTICIPANT: Such as you have a stratified randomization, but we don’t consider about that. We just think it’s simple randomization. So how does it affect the variance?

TING YE: So if the data are generated from stratified randomization, but the variance is calculated under simple randomization, then it can be too large and can lead to too wide confidence intervals and conservative inference.

ALISA STEPHENS: Further questions or comments?

DEVAN MEHROTRA: On the last point, let’s say you stratify by region, and when the data come in, you find that the responses were really not influenced in any meaningful way by region. So I can understand if you stratify by a truly prognostic factor and you then ignore that in the analysis, you will suffer the consequence. But if you stratify with something that people want to stratify because they’ve done it before or because my clinical colleague wants to see that, but if it doesn’t turn out to be prognostic, then presumably that’s not going to – ignoring that in the analysis is not going to hurt you.

TING YE: That’s a very good point. Thanks for raising that. So if a variable is not prognostic, it’s not related to the outcome at all, then ignoring that in the analysis will not lead to conservative variance. For the region variable, sometimes it might be a surrogate of just differences between regions or sites. In those cases, then perhaps we can also include region.

BINGKAI WANG: I’d like to add that many times this needs to be pre-specified. So you don’t really know whether this is going to be prognostic or not. Right?

NICHOLAS SEEWALD: I have a comment from the online audience. John Tamaresis from Stanford says, “Here’s a real-world set of trials that can be used for comparisons that Dr Harrell wants to see. CAR T-cell therapies for diffuse large B-cell lymphoma because PET scans are the standard imaging used to assess disease burden, plus other omic-based methods that monitor circulating cancerous DNA, plus a large number of exploratory measurements.” Got a thumbs up.

ALISA STEPHENS: Thank you. Next. Any other comments from the floor?

PAMELA SHAW: I think I might share this question with someone online. We were thinking about a comment earlier about what I think is constrained randomization, because I do come from the world where re-randomization is you fail therapy and you get re-randomized to two other therapies. So I was really confused. But for this constrained optimization – or randomization, excuse me, that perhaps maybe if it’s the smaller number of communities or the unusual alignment of the covariance that you could wind up in some strange situation – where it turns out the best randomizations are really actually setting up some kind of compounding where certain people are always going to be in the same groups because of how their covariates affect the balance. And I just wondered if you had encountered any side effects of if you’re doing this kind of procedure under extreme situations of extreme imbalance, small numbers of communities, and if you had any advice to give for that situation. And the question online was, when that happens, are you worried about bias, or more variants when these things break down if they break down?

BINGKAI WANG: Thanks for the question. The question is related to the previous question. And also Dylan’s comment. When we have really strong control on the covariance, it’s possible that we can end up with very weird treatment allocation where some outlier is just in one group or others in the other group. This is also possible. So that’s why we always don’t want to push the control too hard, but still leave some space so that there is a reasonably large randomization set so that we can choose the randomized allocation there. And I don’t think it will impact the validity of this design because overall, it’s still randomized. It still has the desired causal interpretation of treatment effect. So I think that part is fine.

UNIDENTIFIED AUDIENCE PARTICIPANT: Just to mention that my mentor, Professor Moulton, has a paper that demonstrated the effect depending on how closely correlated the covariates are, where if you have a design that’s not valid in that scenario, there is an effect on Type I error, but it’s not that huge.

ALISA STEPHENS: I actually have a question regarding the properties that you showed of covariate-constrained randomization and the specific way that you described implementing it. I’ve seen covariate-constrained randomization implemented in different ways. I think the way that you described was you generate an allocation, you check if it meets your thresholds, and if it does, you stop.

BINGKAI WANG: Yes.

ALISA STEPHENS: Right? Another way I’ve seen it implemented is you generate a large number of possible allocations, and you rank them, and then, you randomly choose among some top candidates – and so could you talk about whether your results would apply to the second way of implementing?

BINGKAI WANG: They are the same.

ALISA STEPHENS: Okay.

BINGKAI WANG: You can see the way you did this sequentially doing the way you do. So they are asymptotically, the same.

ALISA STEPHENS: Thank you.

UNIDENTIFIED AUDIENCE MEMBER: But if you as soon as you find the one that’s satisfactory, how do you know what else you could have found? Because, I mean, it could go in all kinds of ways.

BINGKAI WANG: Yeah. But it’s still part of the acceptable randomization allocation, right?

UNIDENTIFIED AUDIENCE MEMBER: Within that constraint. Okay.

BINGKAI WANG: Within that constraint. You can think of it as a rejection sample, actually. You still sample from the distribution.

ALISA STEPHENS: Your ordering

UNIDENTIFIED AUDIENCE MEMBER: Is it not less biased

ALISA STEPHENS: Right. So there’s another piece of information in the second implementation in the sense that you

BINGKAI WANG: After you order this, you still randomly pick one from

ALISA STEPHENS: Yes. That’s true.

BINGKAI WANG: So then that’s the same.

UNIDENTIFIED AUDIENCE MEMBER: Oh, you don’t stop. You’re saying you do still continue to generate a number of them and then select among the ones that are below the threshold?

BINGKAI WANG: So there are two ways. Basically, one way is you do it sequentially. You do it once and generate another one if else, right? You can do it sequentially, or you can do it all at once. You generate 10,000 and randomly pick one, right? So there are two ways. And my argument is they are actually equivalent. Yeah. So we can discuss offline why they are actually equivalent. Yeah.

ALISA STEPHENS: Thank you. I think we have time for one or two more comments if there are. Courtney Schiffman from Genentech.

COURTNEY SCHIFFMAN: Ting, but anyone else on the panel as well who knows the answer to this, we know that for continuous outcomes – let’s say you’re estimating the average treatment effect of the difference in treatment arm means. For continuous outcomes, if there is treatment effect heterogeneity, meaning specifically, if there is a – if there is not a constant absolute delta across the baseline covariate, we know this can dampen the precision gains you get from adjusting for that baseline covariate. Do you have any such intuition if this principle holds for binary outcomes and how we can be thinking about this when we’re not in a continuous outcome setting?

TING YE: That’s a very interesting question. So just to say back, is it so when the delta is heterogeneous; then, considering average treatment effect (ATE) using ANCOVA could hurt the precision?

COURTNEY SCHIFFMAN: Not hurt the precision. You just don’t get as much precision gains as if you had an absolute treatment delta, for example – a constant absolute treatment delta, sorry.

TING YE: I see. don’t know if there are results for binary endpoints – I guess, for binary endpoints because it’s dependent on the scale of the contrast of treatment effect you’re interested in. But that would be an interesting question to look at either empirically or analytically.

COURTNEY SCHIFFMAN: Thank you.

LAURA BALZER: I would just add, echoing a recurrent theme: we want to be very clear about what we’re estimating. I think you’re saying we’re going after the ATE – not a conditional effect. We’re going to use an adjusted estimator that returns a marginal estimate of that, so a G-computation. AIPW, double debiased machine learning, TMLE, whatever you want. Am I rephrasing your question correctly, there’s a theoretical result that said if it’s a constant treatment effect, we should always gain efficiency, but if there’s a heterogeneous one, then we need to include the interaction term in that working regression to gain efficiency?

COURTNEY SCHIFFMAN: Not that you need to necessarily include the interaction term in the working regression, but just that we know that if such treatment-effect heterogeneity exists in a continuous setting, we know to expect not as much precision gains from adjusting for that covariate, regardless of the form.

LAURA BALZER: I see what you’re saying. Again, I think this is the beauty of the adaptive pre-specification framework: we can really let the data speak and figure out which approach is going to give you the most precision because usually we don’t know if there’s going to be a treatment effect that’s constant or heterogeneous. Sometimes, it’s going to be something that we cannot write down in a pretty little GLM. So I would advocate for pre-specifying what you want to do (what your target estimand is) and then getting the most precision for it by using the theory of empirical efficiency maximization.

COURTNEY SCHIFFMAN: Thank you very much.

ALISA STEPHENS: And we have a comment here from –

DEVAN MEHROTRA: I know that in the presentations in the morning and in the afternoon, there was mention of G-computation and, if you’re interested in a difference in proportions, PA minus PB. I’ve never quite understood the following, so maybe the experts could please educate me. So if you have a large enough sample, why do I need to go through G-computation and go through a logistic regression model and then somehow back-transform to the probability scale and get PA minus PB? What if I just treat the zero-one data as being continuous and just apply linear-model thinking and do everything? And my experience over the years has been it worked beautifully. I don’t have to go through G-computation, use bootstrapping, use one scale, and come back to the other. So I’m wondering, in all the theoretical work that you have done and in your practice, have you considered just keeping it simple and forgetting about – of course, that will eliminate 99% of the papers on this topic, but I’m just wondering if you’ve thought about it and what your experience has been.

BINGKAI WANG: I can comment on it. I totally agree with you, sir. And the consideration is if you fit a linear model for binary outcome, the model is misspecified. It is okay. It’s valid. But it’s misspecified. But if we feed a logistic regression, it can be correctly specified. If it is correctly specified, then it’s going to be efficient. So that’s my thinking why we may talk about logistic regression for binary outcome, because it fits the data better.

LAURA BALZER: I would like to make a more general comment on why we want to use substitution or plug-in estimators, like G-computation, especially for binary outcomes. You’re guaranteed to respect the parameter space. I’ve done a lot of work with rare outcomes, and if you just use estimating equation-based approach or linear regression, you can get really weird answers. Even with augmented IPW, I’ve gotten negative probabilities. So I think that there’s a theoretical benefit as well as a real-world one. We want to respect the known bounds; a probability should be between zero and one.

TING YE: One thing I want to add is – so they are all in the framework of G-computation or AIPW, either linear or logistic regression. They basically all fit a model and then do prediction for everyone under-treated or under control. So, echoing all previous points, we essentially want to find the best model. That’s also the beauty of disentangled estimand and the model. Find the best fitting model to do the G-computation and then get the PA minus PB.

DEVAN MEHROTRA: Thank you.

ALISA STEPHENS: Thank you. With that, we have reached time. I’d like to thank all of our afternoon presenters and discussants for a wonderful afternoon of very enriching and just very engaged discussion. Thank you all.

JEFFREY MORRIS: Thank you, everybody, for coming and making this a really great conference. So despite the growing availability and use of observational studies and emulated trials and such, clinical trials remain the cornerstone of modern evidence-based medicine. In today’s climate, in which science faces increasing scrutiny in society, maintaining scientific rigor in biomedical research is more crucial than ever. Our conference has convened statisticians from a wide range of sectors, pharmaceutical and biotech industries, academia, and government to engage with both the theoretical and the practical aspects of clinical trials across diverse contexts. The focus of this year’s conference, covariance adjustment, has long been recognized as a means to improve trial efficacy, although it’s not always been optimally applied in practice. Recent innovations have expanded the methodological toolkit, offering improved efficacy and robustness. These advances increasingly draw on new developments in machine learning, artificial intelligence, causal inference, and other areas of statistics. Theory and practice are advancing hand-in-hand, where real-world trial complexities drive the development of new methods, while theoretical insights offer the structure and justification needed for their effective application.

This conference has offered a comprehensive look at the current landscape of covariate-adjusted design and analysis in clinical trials. I will briefly summarize some of the discussion. In the morning, Stuart Pocock provided a talk that provided a broad overview of covariate adjustment in clinical trials. He discussed some commonly used methods, explained the rationale behind their use, and presented case studies where covariate adjustment had a tangible impact on study conclusions by enhancing efficacy. He also touched on regulatory considerations, which were described in more depth in a later talk by Daniel Rubin. Kelly Van Lackner then demonstrated how covariate adjustment can improve efficiency in two specialized trial designs, in group sequential and information adaptive designs. In group sequential trials, maintaining the independent increments property requires adaptation, while in information adaptive designs, the timing of the interim and the final analyses can be based on fixed information levels, which allows the design to benefit from efficiency gains.

For our third talk, Anqi Zhao addressed covariate adjustment in randomized experiments with missing outcomes in covariates. She demonstrated that even in randomized settings, covariate adjustment can yield efficiency gains, but that missing data often hinders its use in practice. She compared approaches for the partially missing observation setting and showed that the regression approach has advantages when the missingness is completely at randomand the outcome model is truly linear in nature. More generally, propensity score weighting offers advantages over covariate adjustment when dealing with the missingness. Then, Daniel Rubin presented current FDA guidance on covariate adjustment in RCTs for drugs and biologics. He talked about key guidance, including the importance of pre-specifying covariates, number of covariates relative to sample size, and settings in which adapted standard errors should be computed, with a discussion of both linear and non-linear models. He also demonstrated results of efficiency, benefit, and robustness of ANCOVA and mentioned areas requiring further research.

Then, we had an interesting panel discussion hitting on a number of topics, including interaction terms, prognostic indices, missing data concerns, and covariate selection, including the issue of using the study data to select which covariates to include. In our afternoon session, Ting Ye discussed various methods for covariate adjustment and covariate-adaptive designs, focusing on the doubly robust, augmented inverse propensity weighted estimator, and mentioned that these methods are available in the RobinCar package on CRAN. Then Bingkai Wang talked about re-randomization, a flexible approach for covariate-constrained randomization to ensure balance, and discussed asymptotic properties of various estimators, showing that the re-randomization can improve precision and that covariate adjustment can provide further efficiency gains. Our final speaker, Laura Balzer, discussed the use of adaptive, pre-specified TMLE approaches using machine learning for data-adaptive adjustment for covariates to maximize the efficiency gain. This showed how machine learning can be incorporated in a valid fashion in these frameworks. Importantly she applied these approaches to a large community-level cluster randomized HIV trial in Uganda and Kenya to demonstrate the practical utility.

We concluded with a nice discussion on various topics, hitting on machine learning, incorporation in clinical trials, questions about re-randomization, and the non-representativeness of the clinical trial participants among the general population. Collectively, these sessions offered a comprehensive view of covariate adjustment covering theoretical foundations, methodological approaches, practical applications, and regulatory perspectives. I want to give thanks to the program committee for putting together such an outstanding program. I was amazed at how tight the topic was, but yet how diverse and pulled together the talks were to create this coherent sense. So the program committee included Yimei Li, Wei-Ting Hwang, Nick Seewald, Alisa Stephens Shields, Susan Ellenberg, Devan Mehrotra, Pam Shaw, and Michael Proshan. And special thanks to Mary Putt for her leadership as workshop chair.

We want to recognize the many staff members without whose contributions this workshop couldn’t have been possible. Thanks to the PSOM Media Technology & Production team, including Dennis Contini, Joe Lavin, and Ray Rollins. And to our Division of Biostatistics staff, including Joyce Jones and Cathy Vallejo for a great deal of behind-the-scenes work in planning, organizing, and running the conference. And special thanks to Marissa Fox, who, as Associate Director of Biostatistics, is my right hand, and Mary’s also in planning this workshop. Thanks, everybody, for coming and for the great discussion that made for an outstanding workshop!

Footnotes

ORCID iDs

Frank Harrell

Dylan S Small

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

References

Tran

Petersen

Schwab

, et al. Robust variance estimation and inference for causal effect estimation. J Causal Inference, 2023, https://www.degruyterbrill.com/document/doi/10.1515/jci-2021-0067/html (2023, accessed 1 January 2026).

Petersen

van der Laan

. Causal models and learning from data: integrating causal modeling and statistical estimation. Epidemiology 2014; 25(3): 418–426.

Makhema

Wirth

Holme

, et al. Universal testing, expanded treatment, and incidence of HIV infection in Botswana. N Engl J Med 2019; 381(3): 230–242.

Havlir

Balzer

Charlebois

, et al. HIV testing and treatment with the use of a community health approach in rural Africa. N Engl J Med 2019; 381(3): 219–229.

Balzer

van der Laan

Ayieko

, et al. Two-stage TMLE to reduce bias and improve efficiency in cluster randomized trials. Biostatistics 2023; 24(2): 502–517.