Subgroup balancing propensity score

Abstract

This paper concerns estimation of subgroup treatment effects with observational data. Existing propensity score methods are mostly developed for estimating overall treatment effect. Although the true propensity scores balance covariates in any subpopulations, the estimated propensity scores may result in severe imbalance in subgroup samples. Indeed, subgroup analysis amplifies a bias-variance tradeoff, whereby increasing complexity of the propensity score model may help to achieve covariate balance within subgroups, but it also increases variance. We propose a new method, the subgroup balancing propensity score, to ensure good subgroup balance as well as to control the variance inflation. For each subgroup, the subgroup balancing propensity score chooses to use either the overall sample or the subgroup (sub)sample to estimate the propensity scores for the units within that subgroup, in order to optimize a criterion accounting for a set of covariate-balancing moment conditions for both the overall sample and the subgroup samples. We develop two versions of subgroup balancing propensity score corresponding to matching and weighting, respectively. We devise a stochastic search algorithm to estimate the subgroup balancing propensity score when the number of subgroups is large. We demonstrate through simulations that the subgroup balancing propensity score improves the performance of propensity score methods in estimating subgroup treatment effects. We apply the subgroup balancing propensity score method to the Italy Survey of Household Income and Wealth (SHIW) to estimate the causal effects of having debit card on household consumption for different income groups.

Keywords

Covariate balance bias-variance tradeoff causal inference matching weighting stochastic search subgroup analysis

1 Introduction

A central goal in comparative effectiveness research is to estimate the causal effect of a treatment unconfounded by differences between characteristics of subjects assigned to alternative treatment conditions. Comparisons between groups can be biased when the groups are unbalanced with respect to confounders. Propensity score methods¹ have been widely used as a robust approach to achieve covariate balance and draw causal inferences in observational studies.^2,3 The propensity score is the probability that a unit is assigned to one treatment condition; balance of the propensity score leads to balance of the multivariate covariates.

Causal inference has traditionally focused on average treatment effects for the overall population. However, heterogeneity in treatment effects across different subpopulations is the norm rather than exception in medical research. Patients with the same medical condition but different age, gender, race and risk profile usually respond to the same treatment differently.⁴ Research on heterogeneous treatment effects (HTE) has received increasing attention in recent years, where the goal is to identify the subpopulations who are the most or the least responsive to a treatment. In this paper, we focus on a specific type of HTE analysis, namely the propensity-scores-based subgroup analysis (SGA), which aims at estimating the causal effects of the subgroups that are pre-defined by one or several observed covariates. This is in contrast to the HTE studies that aim to identify the subpopulations in a data-driven fashion, as many of the recently developed machine learning HTE methods.^5,6 The routine procedure of such subgroup analysis is:

Estimate the propensity scores based on the full study sample.

Calculate the treatment effects in each of the subgroups defined by one variable (e.g. the male and female subgroups defined by “sex”), via either matching or weighting, based on the propensity scores estimated in Step 1.

Repeat Step 2 for each variable—one at a time—in a set of covariates (e.g. sex, age) pre-selected by the investigators.

Though simple, this type of subgroup analysis is prevalent in medical research. For example, a literature review found that, among the 16 propensity-score-based observational comparative effectiveness analyses published in Journal of American Medical Association (JAMA) in 2017, half of them reported this type of “one-variable-at-a-time” subgroup analysis.

Theoretically, we can show that in addition to balancing covariates for the overall population, the true propensity score also balances covariates for any covariate-defined subpopulations (see section 2), providing the basis for the above subgroup analysis. However, in practice, the propensity scores are usually unknown and must be estimated from the study sample, e.g. through a logistic model or other flexible models fitted to the overall sample. Because the estimated propensity scores rarely exactly match the true propensity scores, severe covariate imbalance in subgroups is not uncommon, leading to potential bias in estimating the subgroup causal effects. The subgroup analysis amplifies a bias-variance tradeoff, whereby increasing complexity of the propensity score model may help to achieve covariate balance within subgroups, but it also increases variance. More specifically, if we fit a propensity score model to the whole sample, we may obtain good overall covariates balance but severe imbalance in some subgroups; conversely, if we fit a separate propensity score model within each subgroup of interest, we would obtain good subgroup balance, but the subsequent treatment effects estimates will have larger variance because of the reduced subgroup sample size. Somewhat surprisingly, despite the prevalent use of subgroup analysis in comparative effectiveness research, to our knowledge, there has been little if any research on adapting propensity score methods to subgroup analysis in observational studies. For example, none of aforementioned JAMA articles that reported subgroup analysis checked subgroup covariates balance or acknowledged the issue.

In this article, we propose the subgroup balancing propensity score (SBPS) method to ensure good subgroup balance as well as to control the variance inflation. The SBPS is a hybrid approach that combines propensity score estimation from the overall sample and the subgroup samples. For each subgroup, we choose to fit a propensity score model using either the overall sample or only the subgroup sample. The combination of estimation samples is chosen by optimizing a criterion that accounts for a set of covariate-balancing moment conditions for both the overall sample and the subgroup samples. Conceptually, the proposed SBPS method extends the covariate balancing propensity score (CBPS) method by Imai and Ratkovic,⁷ which was developed in the context of estimating overall treatment effects. Directly extending CBPS to the subgroup analysis faces two challenges when the number of covariate-balancing moment constraints grows as the number of subgroups grows. First, the computational cost is higher since solving for the optimum given the constraints becomes increasingly difficult. Second, the propensity score model becomes more complex, leading to increasing variance. In SBPS, we devise a stochastic search algorithm that can effectively handle the computational demand when the number of subgroups is large. It also effectively controls variance inflation because the resulting propensity score model is less complex than a direct extension of CBPS.

Section 2 defines the subgroup causal estimands and present the estimators based on propensity score matching and weighting. Section 3 presents the SBPS method and the stochastic search algorithm for estimating the scores. Section 4 conducts simulation studies to compare the performance of SBPS with several existing methods. In Section 5, we apply the SBPS method to the Italy Survey of Household Income and Wealth (SHIW) data to evaluate the effects of debit card possession on monthly household consumption for different income groups. We also conduct a simulation based on the SHIW data to gain more insights of the performance of the proposed method. Section 6 concludes.

2 Estimating subgroup causal effects

Consider a sample of N units, consisting of N_t treated and N_c control units. Suppose that the population can be partitioned into subpopulations based on certain covariates and the interest is to estimate the treatment effects for these subpopulations. Specifically, in this paper we focus on subpopulations defined by a categorical covariate G. For each unit i, denote G_i the indicator of its subgroup, with $G_{i} \in {1, \dots, R}$ . Also denote Z_i the treatment indicator, $X_{i} = (X_{i 1}, \dots, X_{iK})'$ the covariates excluding G, and Y_i(z) the potential outcome corresponding to treatment z for $z \in {0, 1}$ . For each unit, only the potential outcome under the assigned treatment, $Y_{i} = Y_{i} (Z_{i})$ , is observed and the other potential outcome is missing. Let N_r denote the number of units in subgroup r, with $N_{r, t}$ treated units and $N_{r, c}$ control units for $r = 1, \dots, R$ . We are interested in estimating the subgroup ATTs (average treatment effects on the treated)

τ_{r} = E [Y (1) - Y (0) | G = r, Z = 1], r = 1, \dots, R

(1)

The propensity score is the probability of receiving the treatment given the covariates, which includes X and the subgroup label G: $e (X, G) = ⪻ (Z = 1 | X, G) .$ Rosenbaum and Rubin¹ showed that the propensity score is a balancing score in that the covariates and the subgroup label are independent of the treatment variable conditional on the propensity score: $Z ⊥ {X, G} | e (X, G)$ .

We also assume the standard strong ignorability assumption: (i) there is overlap in the propensity score distribution between the treated and control groups, that is, $0 < e (X, G) < 1$ for any X and G; (ii) the treatment assignment is unconfounded given the covariates: $Z ⊥ {Y (1), Y (0)} | {X, G} .$ Then we have the following proposition regarding the balancing properties of propensity scores in subgroups (Proof of Proposition 1 is given in Appendix 1).

Proposition 1

The propensity score balances the distribution of X in each subgroup defined by the categories of a covariate G

X ⊥ Z | {G = r, e (X, G)}, for r = 1, \dots, R

(2)

Matching and weighting are the two most common strategies in estimating causal effects based on propensity scores. Below we discuss propensity score matching and weighting in estimating subgroup causal effects, particularly the subgroup ATTs. In matching, for each treated unit, one finds a matched control unit from the same subgroup using propensity score as the distance metric. Within each subgroup in the matched population, the distributions of the propensity score are expected to be the same between the treated and control groups, that is, for $r = 1, \dots, R$ (we use $f (\cdot)$ to denote a generic probability distribution)

f (e (X, G) | G = r, Z = 1) = f (e (X, G) | G = r, Z = 0)

(3)

Combining equations (2) and (3), for $r = 1, \dots, R$ , we have

\begin{matrix} f (X | G = r, Z = 1) = \int f (X | G = r, Z = 1, e (X, G)) f (e (X, G) | G = r, Z = 1) de (X, G) \\ = \int f (X | G = r, Z = 0, e (X, G)) f (e (X, G) | G = r, Z = 0) de (X, G) = f (X | G = r, Z = 0) \end{matrix}

(4)

Hence the distribution of X is balanced within each subgroup in the matched population. Since the distribution of G is automatically balanced in the overall matched population, the distribution of X is also balanced in the overall matched population. It is straightforward to show that

τ_{r} = E^{(m)} [Y | G = r, Z = 1] - E^{(m)} [Y | G = r, Z = 0]

(5)

where

E^{(m)}

denotes taking expectation over the matched population.

We now consider propensity score weighting. In estimating ATT, each treated unit is weighted by 1 and each control unit is weighted by $w (X, G) = e (X, G) / (1 - e (X, G))$ . It is easy to show that the following set of moment conditions hold

\begin{matrix} M_{k}^{W} \equiv E [Z X_{k} - (1 - Z) \frac{e (X, G)}{1 - e (X, G)} X_{k}] = 0, k = 1, \dots, K; \\ M_{(r)}^{W} \equiv E [Z 1 {G = r} - (1 - Z) \frac{e (X, G)}{1 - e (X, G)} 1 {G = r}] = 0, r = 1, \dots, R; \\ M_{r, k}^{W} \equiv E [1 {G = r} (Z X_{k} - (1 - Z) \frac{e (X, G)}{1 - e (X, G)} X_{k})] = 0, r = 1, \dots, R, k = 1, \dots, K \end{matrix}

(6)

Therefore, the ATT weights balance X for the overall population (reflected by $M_{k}^{W} = 0$ ), balance G for the overall population (reflected by $M_{(r)}^{W} = 0$ ), and balance X for the subgroup populations (reflected by $M_{r, k}^{W} = 0$ ). Note that the ATT weights is a special case of the class of balancing weights^8,9; other members include the inverse probability weights and the overlap weights, each corresponding to a different estimand. We have the following proposition about the identification of the subgroup ATTs based on weighting (Proof of Proposition 2 is given in Appendix 1).

Proposition 2

The subgroup ATTs can be written as

τ_{r} = \frac{E [ZY | G = r]}{⪻ (Z = 1 | G = r)} - \frac{E [(1 - Z) \frac{e (X, G)}{1 - e (X, G)} Y | G = r]}{⪻ (Z = 1 | G = r)}

(7)

where the denominator can also be written as

⪻ (Z = 1 | G = r) = E [(1 - Z) \frac{e (X, G)}{1 - e (X, G)} | G = r]

(8)

The above discussions are in the setting of true propensity scores and population distributions. In observational studies, the propensity scores are usually unknown and have to be estimated from the study sample. Let ${\overset{\land}{e}}_{i}$ denote the estimated propensity score for unit i. With propensity score matching, τ_r can be estimated by a direct comparison

{\overset{\land}{τ}}_{r}^{M} = {\bar{Y}}_{r, t}^{(m)} - {\bar{Y}}_{r, c}^{(m)}

(9)

where

{\bar{Y}}_{r, t}^{(m)}

and

{\bar{Y}}_{r, c}^{(m)}

denote the mean observed outcomes for the treated and control units in the matched sample for subgroup r, respectively. We allow replacement and ties in matching to avoid dependence on the order of picking matching candidates. As bootstrap fails to quantify uncertainty of the matching estimator with replacement,¹⁰ we use the asymptotic variance formula by Abadie and Imbens.¹¹ This can be easily calculated using the Matching package in R. Although the asymptotic formula ignores the uncertainty in estimating the propensity score, it is a good approximation with much computational efficiency. With propensity score weighting, τ_r can be estimated as

{\overset{\land}{τ}}_{r}^{W} = {\bar{Y}}_{r, t} - \frac{\sum_{G_{i} = r, Z_{i} = 0} \frac{{\overset{\land}{e}}_{i}}{1 - {\overset{\land}{e}}_{i}} Y_{i}}{\sum_{G_{i} = r, Z_{i} = 0} \frac{{\overset{\land}{e}}_{i}}{1 - {\overset{\land}{e}}_{i}}}

(10)

where

{\bar{Y}}_{r, t}

is the mean observed outcome for the treated units in the sample for subgroup r, which is used to estimate the first term in equation (7), and the second term in equation (10) is used to estimate the second term in equation (7). In order to achieve computational efficiency, we also ignore the uncertainty in estimating the propensity score, and derive the variance of the weighting estimator by treating it as an estimator of the coefficient of Z in a weighted linear regression of Y on Z with weights

w_{i} = Z_{i} + (1 - Z_{i}) \frac{{\overset{\land}{e}}_{i}}{1 - {\overset{\land}{e}}_{i}}

.¹² This can be easily calculated using the survey package in R.¹³

3 Subgroup balancing propensity score

Conventionally the propensity score is estimated by fitting a logistic regression model or other flexible models such as boosting¹⁴ to the overall (full) sample. Here we use the logistic model as an example

logit [e (X, G)] = \sum_{r = 1}^{R} δ_{r} 1 {G = r} + α^{⊤} X

(11)

The model is usually assessed based on some covariate balancing criterion in the overall study sample.¹⁵ However, the estimated propensity scores often do not give exact balance in the overall sample, and the imbalance can be particularly severe in subsamples. Consequently, estimates of subgroup effects based on these estimated propensity scores may be biased. An alternative approach is to estimate the propensity score separately within each subgroup, for example, by fitting a logistic model to each subgroup sample

logit [e (X, G)] = δ_{r} + α_{r}^{⊤} X, r = 1, \dots, R

(12)

Matching or weighting based on the subgroup-fitted propensity scores would lead to better balance of covariates within each subgroup and thus smaller biases in causal estimates. However, due to the smaller sample sizes of the subsample, the ensuing causal estimates usually have larger variances, embodying a bias-variance tradeoff.

To address this tradeoff, we propose the SBPS as a hybrid approach to adaptively choose between the overall sample fit and the subgroup sample fit by optimizing a criterion that accounts for covariate balance for both the overall sample and the subgroup samples. Below we introduce two such criteria, one based on matching and one based on weighting.

3.1 SBPS: Matching-based criterion

When propensity score matching within each subgroup is conducted, treatment assignment can be treated as random for the overall matched population and for the matched population for each subgroup r. Hence we have the following set of moment conditions based on the standardized mean difference (SMD) measure¹⁶

\begin{matrix} M_{k}^{M} \equiv E^{(m)} [\frac{Z X_{k} - (1 - Z) X_{k}}{σ_{k, t}}] = 0, k = 1, \dots, K; \\ M_{r, k}^{M} \equiv E^{(m)} {\frac{1 {G = r} [Z X_{k} - (1 - Z) X_{k}]}{σ_{r, k, t}}} = 0, r = 1, \dots, R, k = 1, \dots, K \end{matrix}

(13)

where

σ_{k, t}

and

σ_{r, k, t}

denote the standard deviation of X_k for the treated units in the overall population and in the population for subgroup r, respectively. The condition

M_{k}^{M} = 0

reflects balancing of X_k for the overall matched population, and the condition

M_{r, k}^{M} = 0

reflects balancing of X_k for the matched population for subgroup r.

For unit i in the matched sample, let $Z_{i}^{(m)}$ and $G_{i}^{(m)}$ denote the treatment indicator and the subgroup label, let $x_{ik}^{(m)}$ denote the value of X_k, and let $w_{i}^{(m)}$ denote the matching weight, with $w_{i}^{(m)} = 1$ for treated units. Denote $N_{t}^{(m)}$ and $N_{r, t}^{(m)}$ the number of treated units in the overall matched sample and in the matched sample for subgroup r, respectively. Because matching is performed within each subgroup, we have $\sum_{Z_{i}^{(m)} = 0} w_{i}^{(m)} = N_{t}^{(m)}$ and $\sum_{G_{i}^{(m)} = r, Z_{i}^{(m)} = 0} w_{i}^{(m)} = N_{r, t}^{(m)}$ . Let ${\bar{x}}_{k, t}^{(m)} = \sum_{Z_{i}^{(m)} = 1} x_{ik}^{(m)} / N_{t}^{(m)}$ and ${\bar{x}}_{k, c}^{(m)} = \sum_{Z_{i}^{(m)} = 0} w_{i}^{(m)} x_{ik}^{(m)} / N_{t}^{(m)}$ denote the means of $x_{ik}^{(m)}$ for the treated and weighted control units in the overall matched sample, respectively, and let ${\bar{x}}_{r, k, t}^{(m)} = \sum_{G_{i}^{(m)} = r, Z_{i}^{(m)} = 1} x_{ik}^{(m)} / N_{r, t}^{(m)}$ and ${\bar{x}}_{r, k, c}^{(m)} = \sum_{G_{i}^{(m)} = r, Z_{i}^{(m)} = 0} w_{i}^{(m)} x_{ik}^{(m)} / N_{r, t}^{(m)}$ denote the means of $x_{ik}^{(m)}$ for the treated and weighted control units in the matched sample for subgroup r, respectively. Let ${\overset{\land}{σ}}_{k, t}$ and ${\overset{\land}{σ}}_{r, k, t}$ denote the sample standard deviations of X_k for the treated units in the overall sample and in the sample for subgroup r, respectively. The estimates of the moments in equation (13) are

\begin{matrix} {\overset{\land}{M}}_{k}^{M} = \frac{1}{2 N_{t}^{(m)}} [\frac{N_{t}^{(m)} {\bar{x}}_{k, t}^{(m)} - N_{t}^{(m)} {\bar{x}}_{k, c}^{(m)}}{{\overset{\land}{σ}}_{k, t}}] = \frac{1}{2} [\frac{{\bar{x}}_{k, t}^{(m)} - {\bar{x}}_{k, c}^{(m)}}{{\overset{\land}{σ}}_{k, t}}], \\ {\overset{\land}{M}}_{r, k}^{M} = \frac{1}{2 N_{t}^{(m)}} [\frac{N_{r, t}^{(m)} {\bar{x}}_{r, k, t}^{(m)} - N_{r, t}^{(m)} {\bar{x}}_{r, k, c}^{(m)}}{{\overset{\land}{σ}}_{r, k, t}}] = \frac{1}{2} \frac{N_{r, t}^{(m)}}{N_{t}^{(m)}} [\frac{{\bar{x}}_{r, k, t}^{(m)} - {\bar{x}}_{r, k, c}^{(m)}}{{\overset{\land}{σ}}_{r, k, t}}] \end{matrix}

(14)

Here ${\overset{\land}{M}}_{k}^{M}$ reflects balancing of X_k for the overall matched sample, with $| {\bar{x}}_{k, t}^{(m)} - {\bar{x}}_{k, c}^{(m)} | / {\overset{\land}{σ}}_{k, t}^{(m)}$ being the standardized mean difference of X_k in the overall matched sample; ${\overset{\land}{M}}_{r, k}^{M}$ reflects balancing of X_k for the matched sample for subgroup r, with $| {\bar{x}}_{r, k, t}^{(m)} - {\bar{x}}_{r, k, c}^{(m)} | / {\overset{\land}{σ}}_{r, k, t}^{(m)}$ being the standardized mean difference of X_k in the matched sample for subgroup r. These estimates have accounted for different variances of the covariates and different sample sizes in the overall matched sample and the matched samples for the subgroups. Therefore, they have comparable variances. The objective function is the sum of squares of these estimates, i.e.

F^{M} = \sum_{k = 1}^{K} ({\overset{\land}{M}}_{k}^{M}) 2 + \sum_{r = 1}^{R} \sum_{k = 1}^{K} ({\overset{\land}{M}}_{r, k}^{M}) 2

(15)

We consider nearest neighbour matching with caliper¹⁷ to remove possible bias of matching. For each subgroup r, we set the caliper to one-fourth of the standard deviation of the logit of estimated propensity scores for units in this subgroup; each treated unit in this subgroup is matched with a control unit in this subgroup whose logit of estimated propensity score is closest to and within the caliper of that of the former, and units that cannot be matched are dropped from further analysis. We allow replacement and ties in matching; hence the weights for matched treated units are all equal to one, but the weights for matched control units may be smaller or larger than one.

3.2 SBPS: Weighting-based criterion

The second criterion is related to covariate balance for propensity score weighting. For unit i ( $i = 1, \dots, N$ ), let x_ik denote the value of the X_k. In the weighted sample, each treated unit is weighted by one and each control unit is weighted by ${\overset{\land}{e}}_{i} / (1 - {\overset{\land}{e}}_{i})$ . Let ${\bar{x}}_{k, t}$ denote the mean of x_ik for the treated units in the overall sample, and let ${\bar{x}}_{r, k, t}$ denote the mean of x_ik for the treated units in the sample for subgroup r. The estimates of the moments conditions in equation (6), scaled by the sample standard deviations of the covariates for the treated units, are

\begin{matrix} {\overset{\land}{M}}_{k}^{W} = \frac{1}{N} [N_{t} {\bar{x}}_{k, t} - \sum_{Z_{i} = 0} \frac{{\overset{\land}{e}}_{i}}{1 - {\overset{\land}{e}}_{i}} x_{ik}] / {\overset{\land}{σ}}_{k, t}, \\ {\overset{\land}{M}}_{(r)}^{W} = \frac{1}{N} [N_{r, t} - \sum_{G_{i} = r, Z_{i} = 0} \frac{{\overset{\land}{e}}_{i}}{1 - {\overset{\land}{e}}_{i}}], \\ {\overset{\land}{M}}_{r, k}^{W} = \frac{1}{N} [N_{r, t} {\bar{x}}_{r, k, t} - \sum_{G_{i} = r, Z_{i} = 0} \frac{{\overset{\land}{e}}_{i}}{1 - {\overset{\land}{e}}_{i}} x_{ik}] / {\overset{\land}{σ}}_{r, k, t} \end{matrix}

(16)

These estimates have comparable variances in a similar way to those in the matching-based criterion. The objective function is the sum of squares of these estimates, i.e.

F^{W} = \sum_{k = 1}^{K} ({\overset{\land}{M}}_{k}^{W}) 2 + \sum_{r = 1}^{R} ({\overset{\land}{M}}_{(r)}^{W}) 2 + \sum_{r = 1}^{R} \sum_{k = 1}^{K} ({\overset{\land}{M}}_{r, k}^{W}) 2

(17)

The moment conditions for $M_{k}^{W}$ and $M_{(r)}^{W}$ have been used previously by the CBPS method.⁷ The over-identified version of CBPS uses the generalized method of moments (GMM) to incorporate these moment conditions and the score conditions for the maximum likelihood in estimating the propensity score, where the total number of conditions exceeds that of model parameters. The just-identified version of CBPS only uses the moment conditions in estimating the propensity score. In the context of subgroup analysis with the additional RK moment conditions for $M_{r, k}^{W}$ , the number of moment conditions grows as the number of subgroups grows. Such increase in complexity of the propensity score model can lead to both computational difficulty and variance inflation.

3.3 A stochastic search algorithm to estimate the SBPS

In SBPS, for each subgroup we choose between the overall sample fit and the subgroup sample fit. Specifically, for $r = 1, \dots, R$ , let S_r = 1 if the propensity scores for units in subgroup r are estimated by fitting the model in equation (11) using the overall sample, and let S_r = 2 if these propensity scores are estimated by fitting the model in equation (12) using the subgroup sample. Let $S = (S_{1}, \dots, S_{R})$ . The criterion F^M or F^W can be regarded as a function of S .

If R is small, we can exhaustively search through all 2^R possible combinations for S to find the optimal combination that minimizes F^M or F^W. If R is large, we propose to use the following stochastic search algorithm to find S that minimizes F^M or F^W.

Initialize $S_{min}$ to be a vector of all ones, which corresponds to the standard propensity score method (PS) that uses the overall sample fit to estimate propensity scores for all subgroups. Set $F_{min}^{M}$ or $F_{min}^{W}$ to be the corresponding value of F^M or F^W.

Repeat the following steps until the number of repeats is no smaller than L₁, and the minimum value of the objective function encountered in the search, $F_{min}^{M}$ or $F_{min}^{W}$ , does not change over L₂ repeats, where L₁ and L₂ are prespecified positive integers.

Randomly permutate ${1, \dots, R}$ to get a random ordering of the subgroups, ${A_{1}, \dots, A_{R}}$ .

For $r = 1, \dots, R$ , randomly initialize $S_{A_{r}} = 1$ or $S_{A_{r}} = 2$ .

Repeat the following step to update S in the order ${A_{1}, \dots, A_{R}}$ until there is no change in the elements of S . • For $r = 1, \dots, R$ , update $S_{A_{r}}$ to the value that gives a smaller value of F^M or F^W while fixing the other elements in S , ${S_{A_{r'}}, r' \neq r}$ .

If the value of F^M or F^W is smaller than $F_{min}^{M}$ or $F_{min}^{W}$ , then set $F_{min}^{M} = F^{M}$ or $F_{min}^{W} = F^{W}$ , and set $S_{min} = S$ .

When R is large, results of the stochastic search algorithm depend on the random initial values and the random orders of updates for S . Although this algorithm cannot guarantee to find the globally optimal combination for S , it can guarantee to find some locally optimal combination for S that gives a smaller value of F^M or F^W than the standard propensity score method.

4 Simulations

This section presents some Monte Carlo simulations to examine the performance of SPBS compared to several other methods. Consider two scenarios: (a) the number of subgroups is R = 20, and the number of units in each subgroup is N_r = 100 ( $r = 1, 2, \dots, R$ ); (b) R = 40 and N_r = 50 ( $r = 1, 2, \dots, R$ ). Assume that there are four covariates: $X_{1} \sim N (μ_{r}, 1)$ if G = r, where $μ_{r} = 3 - 3 (r - 1) / (R - 1)$ (i.e. μ_r varies from 3 to 0 with the same decrement 3/(R – 1)); $X_{2} \sim Unif (0, 1); X_{2} \sim N (0, 1); X_{4} \sim Bernoulli (0.4)$ . The treatment assignment is generated using the following logistic model

logit [e (X, G)] = \sum_{r = 1}^{R} δ_{r} 1 {G = r} + α_{1} X_{1} + α_{2} X_{2} + α_{3} X_{3} + α_{4} X_{4} + α_{5} X_{1}^{2} + α_{6} X_{1} X_{4}

(18)

where

α \equiv (α_{1}, \dots, α_{6}) = (- 1.5, - 0.5, 0.5, - 0.5, 0.5, 0.5)

and the fixed effects are

δ_{r} = - 1 + 2 (r - 1) / (R - 1)

(i.e. δ_r varies from –1 to 1 with the same increment 2/(R – 1)). The outcome Y is generated using the following linear model

Y = β_{0} + \sum_{r = 1}^{R} η_{r} [1 {G = r} Z] + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{4} + β_{5} X_{1}^{2} + β_{6} X_{1} X_{4} + ɛ

(19)

where

(β_{0}, β_{1}, \dots, β_{6}) = (200, 20, 10, 10, 10, - 5, 10)

, the true subgroup ATTs are given by

τ_{r} = η_{r} = - 10 + 20 (r - 1) / (R - 1)

(i.e. η_r varies from –10 to 10 with the same increment 20/(R – 1)), and the noise term

ɛ \sim N (0, 1)

. We generate V = 1000 data sets.

We compare the performance of the following methods.

PS: the standard PS method with model (11) fitted to the overall sample;

PS-X: the PS method with group-covariates interactions, given by model (12);

CBPS: the standard CBPS method with the moment conditions for $M_{k}^{W}$ and $M_{(r)}^{W}$ ;

CBPS-X: the CBPS method with the moment conditions for $M_{k}^{W}, M_{(r)}^{W}$ , and $M_{r, k}^{W}$ ;

GBM : the gradient boosted logistic regression method for propensity score estimation in McCaffrey et al.,¹⁸ with predictors being covariates X and group indicators $1 {G = r} (r = 1, \dots, R)$ ;

C-F: the causal forest method in Wager and Athey,⁶ with predictors being covariates X and group indicators;

SBPS: the SBPS method that optimizes F^M;

SBPS-X: the SBPS method that optimizes F^W.

For CBPS, Imai and Ratkovic⁷ used the “continuous updating” GMM estimator of Hansen et al.¹⁹ which has better finite-sample properties than the usual optimal GMM estimator. However, the “continuous updating” GMM method cannot converge in one week for any of our simulations. Therefore, we use the two-step optimal GMM estimator, which is the default estimation method in the R package CBPS. We also find that the just-identified CBPS generally performs similar to or worse than the corresponding over-identified CBPS. So we only show results for the over-identified CBPS. The causal forest method does not model propensity scores and is instead an outcome-model-based approach. It directly models the outcomes and tries to identify the subpopulations with significant heterogeneous treatment effects in a data-driven fashion.

For all methods, $X = (X_{1}, X_{2}, X_{3}, X_{4}, X_{1}^{2}, X_{1} X_{4}) ⊤$ if the model is correctly specified, and $X = (X_{1}, X_{2}, X_{3}, X_{4}) ⊤$ if the model is misspecified. For the PS, CBPS, PS-X, CBPS-X and GBM methods, the propensity score model is fitted to the overall sample to estimate ${\overset{\land}{e}}_{i}$ for all units. For the SBPS-M and SBPS-W methods, to estimate propensity scores ${\overset{\land}{e}}_{i}$ for units in subgroup r, the propensity score model in equation (11) is fitted to the overall sample if S_r = 1, and the propensity score model in equation (12) is fitted to the subgroup sample if S_r = 2. In the stochastic search algorithm, we set L₁ = 1000 and L₂ = 50. In our simulations, different values of L₁ do not make much difference, and larger values for L₂ also do not make much difference. Once ${\overset{\land}{e}}_{i}$ is obtained using a given method, propensity score matching or weighting can be used to estimate τ_r.

We first check the performance of the propensity score methods in balancing covariate distributions in both the subgroup samples and the overall sample. The absolute standardized difference (ASD) of X_k in subgroup r is

D_{r, k} = | \frac{\sum_{G_{i} = r, Z_{i} = 1} x_{ik} w_{i}}{\sum_{G_{i} = r, Z_{i} = 1} w_{i}} - \frac{\sum_{G_{i} = r, Z_{i} = 0} x_{ik} w_{i}}{\sum_{G_{i} = r, Z_{i} = 0} w_{i}} | / \sqrt{{\overset{\land}{σ}}_{r, k, t}^{2} / N_{r, t} + {\overset{\land}{σ}}_{r, k, c}^{2} / N_{r, c}}

(20)

Here w_i is the weight for unit i. In the original sample, w_i equals one for every unit. With estimated propensity scores ${\overset{\land}{e}}_{i}$ , w_i = 1 for treated units and $w_{i} = \frac{{\overset{\land}{e}}_{i}}{1 - {\overset{\land}{e}}_{i}}$ for control units. The terms ${\overset{\land}{σ}}_{r, k, t}^{2}$ and ${\overset{\land}{σ}}_{r, k, c}^{2}$ are the unweighted sample variances of X_k for treated and control units in subgroup r. We use $D_{k}^{group} = {max}_{r = 1}^{R} D_{r, k}$ , the maximum value of $D_{r, k}$ across all subgroups r, to reflect the degree of balance in the subgroups. We also calculate the ASD of X_k in the overall sample $D_{k}^{overall}$

D_{k}^{overall} = | \frac{\sum_{Z_{i} = 1} x_{ik} w_{i}}{\sum_{Z_{i} = 1} w_{i}} - \frac{\sum_{Z_{i} = 0} x_{ik} w_{i}}{\sum_{Z_{i} = 0} w_{i}} | / \sqrt{{\overset{\land}{σ}}_{k, t}^{2} / N_{t} + {\overset{\land}{σ}}_{k, c}^{2} / N_{c}}

(21)

where

{\overset{\land}{σ}}_{k, t}^{2}

and

{\overset{\land}{σ}}_{k, c}^{2}

are the unweighted sample variances of X_k for treated and control units in the overall sample. A smaller value of

D_{k}^{group}

D_{k}^{overall}

indicates a better balance.

For the simulations with R = 20 and N_r = 100 ( $r = 1, \dots, R$ ), Figure 1 presents the boxplots of the average values of $D_{k}^{group}$ across the 1000 simulated datasets, in the original sample and in the propensity score weighted samples with propensity scores estimated by different methods. Both SBPS-M and SBPS-W improve subgroup covariate balance over the standard PS/CBPS and GBM methods. Their performance is comparable to PS-X and slightly worse than CBPS-X. Figure 2 presents the boxplots of the average values of $D_{k}^{overall}$ across the 1000 simulated datasets. The SBPS methods achieve comparable overall covariate balance to CBPS, and are better than the other methods. We have similar findings in the simulations with R = 40 and N_r = 50 ( $r = 1, \dots, R$ ).

Figure 1.

Boxplots of $D_{k}^{group}$ in the simulations with R = 20 and N_r = 100 $(r = 1, \dots, R)$ .

Figure 2.

Boxplots of $D_{k}^{overall}$ in the simulations with R = 20 and N_r = 100 $(r = 1, \dots, R)$ .

We next examine the performance of different methods in estimating subgroup ATTs. For the v'th data set, let ${\overset{\land}{τ}}_{r, v}$ denote the value of ${\overset{\land}{τ}}_{r} ({\overset{\land}{τ}}_{r}^{M}$ or ${\overset{\land}{τ}}_{r}^{W}$ ). To evaluate the various methods, we consider three performance measures. The first measure is absolute bias

B_{r} = | \frac{1}{V} \sum_{v = 1}^{V} {\overset{\land}{τ}}_{r, v} - τ_{r} |

(22)

and the second measure is root mean squared error (RMSE)

E_{r} = \sqrt{\frac{1}{V} \sum_{v = 1}^{V} {({\overset{\land}{τ}}_{r, v} - τ_{r})}^{2}}

(23)

We also consider 95% confidence intervals of the form ${\overset{\land}{τ}}_{r, v} \pm 1.96 \times \overset{\land}{SE} ({\overset{\land}{τ}}_{r, v})$ , where $\overset{\land}{SE} ({\overset{\land}{τ}}_{r, v})$ is the standard error of ${\overset{\land}{τ}}_{r, v}$ . The third performance measure C_r is the proportion of confidence intervals that cover τ_r. We then average the performance measures across subgroups to get

\bar{B} = \frac{1}{R} \sum_{r = 1}^{R} B_{r}, \bar{E} = \frac{1}{R} \sum_{r = 1}^{R} E_{r} and \bar{C} = \frac{1}{R} \sum_{r = 1}^{R} C_{r}

(24)

Tables 1 and 2 report the values of

\bar{B}, \bar{E}

and

\bar{C}

for different methods using direct matching estimator

{\overset{\land}{τ}}^{M}

and weighting estimator

{\overset{\land}{τ}}^{W}

, respectively. As an outcome-model-based approach, the C-F method does not use

{\overset{\land}{τ}}^{M}

{\overset{\land}{τ}}^{W}

, but we include its performance in both Tables 1 and 2 for easy comparison. In most of the cases, the weighting estimator has larger absolute bias and larger RMSE than the direct matching estimator. This can be attributed to the fact that in this simulation, the propensity score does not have common support in the treated and control groups, which is taken into account by the direct matching estimator but not by the weighting estimator. We therefore focus on the direct matching estimator.

Table 1.

Average performance measures of different methods using direct matching estimator ${\overset{\land}{τ}}^{M}$ to estimate subgroup ATTs in the simulations, with 20 subgroups and 100 units per subgroup or 40 subgroups and 50 units per subgroup.

			PS	PS-X	CBPS	CBPS-X	GBM	C-F^a	SBPS-M	SBPS-W
R = 20	Correct	Bias	0.17	0.15	0.28	0.23	1.93	0.46	0.17	0.48
N_r = 100		RMSE	7.20	5.16	7.26	5.99	7.40	4.23	4.62	6.29
		Coverage	0.86	0.96	0.86	0.96	0.84	0.83	0.97	0.91


	Misspecified	Bias	6.73	2.77	6.48	2.04	2.27	4.05	2.84	4.62
		RMSE	8.92	6.15	9.02	5.49	7.73	7.81	6.02	7.61
		Coverage	0.64	0.94	0.66	0.96	0.83	0.77	0.94	0.81
R = 40	Correct	Bias	0.24	0.31	0.25	0.18	0.50	1.81	0.15	0.43
N_r = 50		RMSE	8.50	9.65	8.39	12.52	11.59	2.85	7.68	9.63
		Coverage	0.89	0.85	0.84	0.73	0.74	0.69	0.94	0.90


	Misspecified	Bias	6.57	4.48	5.57	5.32	2.11	0.62	2.63	4.35
		RMSE	9.35	10.52	10.50	16.85	9.95	10.92	7.24	8.87
		Coverage	0.72	0.80	0.73	0.62	0.76	0.64	0.92	0.83

The C-F method does not use ${\overset{\land}{τ}}^{M}$ . Its performance is included here for easy comparison.

Table 2.

Average performance measures of different methods using weighting estimator ${\overset{\land}{τ}}^{W}$ to estimate subgroup ATTs in the simulations, with 20 subgroups and 100 units per subgroup or 40 subgroups and 50 units per subgroup.

			PS	PS-X	CBPS	CBPS-X	GBM	C-F^a	SBPS-M	SBPS-W
R = 20	Correct	Bias	1.33	2.02	1.88	2.53	2.53	0.46	1.81	2.38
N_r = 100		RMSE	9.40	6.52	8.39	6.09	8.37	4.23	7.25	7.73
		Coverage	0.87	0.96	0.88	0.97	0.82	0.83	0.94	0.92


	Misspecified	Bias	6.72	4.00	6.43	3.46	2.84	4.05	4.07	5.33
		RMSE	9.37	5.94	9.11	7.00	8.61	7.81	6.32	7.60
		Coverage	0.66	0.93	0.67	0.95	0.81	0.77	0.90	0.81
R = 40	Correct	Bias	1.64	2.58	2.28	2.80	2.72	1.81	2.27	2.39
N_r = 50		RMSE	11.69	10.93	10.67	9.42	10.07	2.85	8.85	11.03
		Coverage	0.86	0.85	0.87	0.77	0.83	0.69	0.94	0.88


	Misspecified	Bias	6.49	3.85	6.26	3.26	3.18	0.62	3.99	5.43
		RMSE	10.83	10.37	10.62	9.10	10.49	10.92	7.94	9.18
		Coverage	0.75	0.85	0.75	0.77	0.81	0.64	0.93	0.86

The C-F method does not use ${\overset{\land}{τ}}^{W}$ . Its performance is included here for easy comparison.

Compared to PS, CBPS and GBM, SBPS-M has smaller average RMSE and better coverage rate, regardless of whether the model is correctly specified or misspecified. Compared to C-F, when the model is correctly specified, SBPS-M has larger average RMSE, but has much better coverage; when the model is misspecified, SBPS-M has smaller average RMSE and much better coverage. As all models are likely to be misspecified to a certain degree in real applications, SBPS-M leads to more robust results. We next compare SBPS-M with PS-X and CBPS-X. When R = 20 and N_r = 100, in most cases SBPS-M has smaller RMSE, except that SBPS-M has slightly larger RMSE than CBPS-X when the model is misspecified. The coverage rates of SBPS-M and PS-X/CBPS-X are comparable. When R = 40 and N_r = 50, SBPS-M has smaller average RMSE and better coverage than PS-X/CBPS-X. SBPS-W does not perform as well as SBPS-M. We recommend to use SBPS-M combined with the direct matching estimator.

For SBPS-M, if the model is correctly specified, the average proportion of subgroups using subgroup sample fit (i.e. with S_r = 2) is 72.6% when R = 20 and N_r = 100, and 72.0% when R = 40 and N_r = 50; if the model is misspecified, the average proportion of subgroups using subgroup sample fit is 85.1% when R = 20 and N_r = 100, and 81.1% when R = 40 and N_r = 50. Hence in order to achieve better subgroup covariate balance, the subgroup sample fit instead of the overall sample fit is used for a fairly large proportion of subgroups.

We have also compared the performance in estimating the overall ATT using different methods (details not reported). We find that in general the SBPS method is comparable to the standard PS/CBPS, but better than PS-X/CPBS-X, GBM and C-F in estimating the overall ATT.

5 Application to the SHIW

5.1 Background

We apply the proposed method to the 1993–1995 Italy Survey of Household Income and Wealth (SHIW) data analyzed in Mercatanti and Li²⁰ to evaluate the effect of having debit cards on household spending. In this application, the treatment variable equals one if the household possesses one and only one debit card and zero if the household does not possess debit cards during year 1993–1995, and the outcome is the average monthly household consumption on all consumer goods in 1995. The covariates, listed in Table 3, include the lagged outcome in 1993, background demographic and social variables referred either to the household or to the head householder, number of banks and yearly-based average interest rate in the province where the household lives.

Table 3.

Variables in the SHIW data.

Variable	Description
Y	Average monthly consumption on all consumer goods in 1995
	(in thousands of Italian liras)
Z	=1 if the household possesses one and only one debit card
	=0 if the household does not possess debit cards
G	Income group based on the overall household income (in thousands of Italian Liras)
	=1 if $\leq 20, 000$
	=2 if 20,000–30,000
	=3 if 30,000–40,000
	=4 if 40,000–50,000
	=5 if 50,000–60,000
	=6 if > 60,000
X ₁	Average monthly consumption on all consumer goods in 1993
X ₂	The overall household wealth
X ₃	The Italian geographical macro-area where the household lives:
	north (baseline), center, south and islands.
X ₄	The number of inhabitants of the town where the household lives:
	< 20,000 (baseline), 20,000–40,000, 40,000–500,000, > 500,000
X ₅	The number of household members:
	1 (baseline), 2, 3, 4, > 4
X ₆	The number of earners in the household:
	1 (baseline), 2, 3, > 3
X ₇	Average age of the household:
	< 31 (baseline), 31–40, 41–50, 51–65, > 65
X ₈	Education of the head of the household:
	None (baseline), elementary school, middle school, high school, university
X ₉	Age of the head of the household:
	< 31 (baseline), 31–40, 41–50, 51–65, > 65
X ₁₀	Number of banks
X ₁₁	Average interest rate in the province where the household lives

Based on the SHIW data, Mercatanti and Li²⁰ found statistically significant positive average effects of possessing debit cards on household spending for the whole population. The findings can be explained by the mental accounting theory which describes how consumers and their households organize, evaluate, and record their economic activities.^21–23 Non-cash payment instruments, such as debit and credit cards, decouple purchases from payment and mix different purchases. Consumers have cognitive biases and hence may spend more when they use the non-cash methods.

Here we further investigate the subgroup effects of debit cards possession on consumption for households with different income levels. Because the marginal propensity to consume,²⁴ or the additional amount of consumption induced by a unit of additional income, decreases with income, consumers' additional amount of consumption using non-cash methods may also decrease with income. We partition households in the SHIW data into six income groups based on the overall household income in units of thousands of Italian liras (with G correspondingly equal to 1 to 6):

\leq 20, 000, 20, 000 - 30, 000, 30, 000 - 40, 000, 40, 000 - 50, 000, 50, 000 - 60, 000

, and

> 60, 000

, and investigate the effects of debit cards possession on the monthly household consumption for each income group. Table 4 presents the number of treated and control units in each income group.

Table 4.

Number of treated and control units in each income group in the SHIW data.

	Income Group
	≤20,000	20,000–30,000	30,000–40,000	40,000–50,000	50,000–60,000	>60,000
#Treated	17	47	41	36	26	49
#Control	126	189	168	119	77	105

5.2 SHIW-based simulations

We first conduct a simulation study based on the SHIW data to better understand the performance of the methods. We fix the covariates X and the group label G as in the real data. We generate the treatment assignment using the logistic model (11), where $α$ is set to the estimated coefficients of X from fitting (11) to the real data, and $(δ_{1}, \dots, δ_{6}) = (0.3, 0.4, 0.5, 0.6, 0.7, 0.8)$ . The outcome Y is generated using the following linear model

Y = β_{0} + \sum_{r = 1}^{R} η_{r} [1 {G = r} Z] + β^{⊤} X + ɛ

(25)

We conduct propensity score matching over the real data based on the estimated propensity score from fitting (11) to the real data, apply equation (25) to the matched sample, and set β₀ and $β$ to be the estimated intercept and coefficients of X . We then set $(η_{1}, \dots, η_{6}) = (500, 400, 300, 200, 100, 0)$ and generate $ɛ \sim N (0, 50^{2})$ .

We estimate the subgroup ATTs using the methods similarly as in Section 4, except that for the SBPS-M and SBPS-W methods exhaustive search is used to find the optimal combination for S (because the number of subgroups is small). When the model is correctly specified, we include all covariates (X₁ to X₁₁) in Table 3; when the model is misspecified, we include all covariates other than X₂ (the overall household wealth).

Table 5 reports the values of

\bar{B}, \bar{E}

and

\bar{C}

for various estimation methods. In this simulation, the propensity score has better overlap in the treated and control groups than in the simulation in Section 4, and the weighting estimator gives smaller average RMSE than the direct matching estimator. The two SBPS methods have similar performance, and their performance is comparable to the standard PS/CBPS. Compared to PS-X/CBPS-X, the SBPS methods have smaller average RMSE when the direct matching estimator is used. Compared to GBM or C-F, the SBPS methods have smaller average RMSE when the model is misspecified, and have higher coverage rate regardless of whether the model is correctly specified or misspecified.

Table 5.

Average performance measures of different methods in estimating subgroup ATTs in the SHIW-based simulations.

			PS	PS-X	CBPS	CBPS-X	GBM	C-F^a	SBPS-M	SBPS-W
${\overset{\land}{τ}}^{M}$	Correct	Bias	8.11	7.21	6.41	6.16	89.68	26.94	2.94	6.30
		RMSE	112.39	126.68	112.41	128.69	174.91	46.96	107.54	109.55
		Coverage	0.94	0.96	0.93	0.97	0.78	0.79	0.96	0.95

	Misspecified	Bias	47.21	30.96	45.26	39.46	130.82	53.82	46.10	43.13
		RMSE	126.68	139.19	124.57	135.67	191.29	157.26	122.94	123.61
		Coverage	0.91	0.91	0.91	0.95	0.75	0.77	0.93	0.92
${\overset{\land}{τ}}^{W}$	Correct	Bias	16.66	18.34	18.17	44.53	40.94	26.94	10.84	11.08
		RMSE	88.38	93.56	85.10	89.11	123.53	46.96	90.56	89.28
		Coverage	0.96	0.98	0.96	0.99	0.85	0.79	0.97	0.97

	Misspecified	Bias	46.78	29.93	44.42	41.22	44.51	53.82	39.12	47.03
		RMSE	104.16	103.35	100.02	109.06	130.49	157.26	104.84	105.25
		Coverage	0.95	0.98	0.95	0.99	0.67	0.77	0.96	0.96

The C-F method does not use ${\overset{\land}{τ}}^{M}$ or ${\overset{\land}{τ}}^{W}$ . Its performance is included in both parts for easy comparison.

We have also compared the performance in estimating the overall ATT using different methods (details not reported). Similar to the findings in Section 4, in general the SBPS method is comparable to the standard PS/CBPS, but better than PS-X/CPBS-X, GBM and C-F in estimating the overall ATT.

5.3 Real data analysis

For real data analysis, we excluded the GBM and C-F methods and applied the remaining six propensity score estimation methods, where for the SBPS-M and SBPS-W methods exhaustive search is used to find the optimal combination for S . Figures 3 and 4 present the boxplots of the balance measures $D_{k}^{group}$ and $D_{k}^{overall}$ , as defined in Section 4. In terms of subgroup covariate balance, the performance of SBPS-M is better than the standard PS/CBPS and SBPS-W, comparable to PS-X, and worse than CBPS-X. In terms of overall covariate balance, the performance of SBPS-M and SBPS-W is better than PS-X, comparable to CBPS-X, and worse than the standard PS/CBPS.

Figure 3.

Boxplots of $D_{k}^{group}$ in real data analysis.

Figure 4.

Boxplots of $D_{k}^{overall}$ in real data analysis.

Table 6 shows the estimated effects and the associated p-values. Here the p-value is calculated as

2 Φ^{- 1} (| {\overset{\land}{τ}}_{r} / \overset{\land}{SE} ({\overset{\land}{τ}}_{r}) |)

, where Φ is the distribution function for a standard normal distribution, and

\overset{\land}{SE} ({\overset{\land}{τ}}_{r})

is the standard error of

{\overset{\land}{τ}}_{r}

. We consider a significance level of 0.05. For the direct matching estimate, the PS-X, CBPX-X and SBPS-M methods found significant positive effects (681.25, 560.00 and 681.25) for the lowest income group, which is consistent with our conjecture that the effects of debit cards possession on household spending decrease with income. For the weighting estimate, the PS-X and SBPS-M methods found significant positive effects (528.41 and 528.41) for the lowest income group, but the PS method and the SBPS-M method also found a significant positive effect 367.34 for the highest income group, which seems inconsistent with our conjecture.

Table 6.

Estimated treatment effects for each income group in the SHIW data using the different propensity score estimation methods, and the associated p-values and adjusted p-values.

		${\overset{\land}{τ}}^{M}$			${\overset{\land}{τ}}^{W}$
Method	Group	Effect	p-value	Adj. p	Effect	p-value	Adj. p
PS	$\leq 20, 000$	273.53	0.17	0.47	323.78	0.06	0.17
	20,000–30,000	−105.74	0.34	0.50	−8.38	0.93	0.93
	30,000–40,000	−246.46	0.19	0.47	−133.32	0.23	0.45
	40,000–50,000	−71.43	0.70	0.70	−35.24	0.79	0.93
	50,000–60,000	134.62	0.48	0.58	133.86	0.41	0.61
	>60,000	311.22	0.23	0.47	367.34	0.04	0.17
PS – X	$\leq 20, 000$	681.25	0.00	0.02	528.41	0.01	0.08
	20,000–30,000	−139.57	0.44	0.65	−29.98	0.79	0.79
	30,000–40,000	59.35	0.68	0.68	−49.09	0.67	0.79
	40,000–50,000	196.77	0.24	0.65	144.80	0.33	0.50
	50,000–60,000	125.00	0.54	0.65	264.29	0.15	0.30
	>60,000	173.96	0.46	0.65	283.66	0.14	0.30
CBPS	$\leq 20, 000$	105.88	0.62	0.65	312.71	0.06	0.19
	20,000–30,000	−83.77	0.50	0.65	−14.33	0.88	0.88
	30,000–40,000	−113.41	0.42	0.65	−139.68	0.21	0.41
	40,000–50,000	148.57	0.32	0.65	−59.67	0.66	0.79
	50,000–60,000	90.00	0.65	0.65	124.31	0.44	0.67
	>60,000	355.21	0.23	0.65	346.50	0.06	0.19
CBPS – X	$\leq 20, 000$	560.00	0.02	0.09	400.72	0.06	0.38
	20,000–30,000	50.67	0.67	0.67	−15.39	0.88	0.88
	30,000–40,000	−267.09	0.13	0.26	−63.80	0.57	0.69
	40,000–50,000	151.79	0.40	0.47	118.50	0.40	0.69
	50,000–60,000	186.00	0.38	0.47	130.31	0.48	0.69
	>60,000	406.67	0.11	0.26	209.88	0.27	0.69
SBPS – M	$\leq 20, 000$	681.25	0.00	0.02	528.41	0.01	0.03
	20,000–30,000	−139.57	0.44	0.70	−29.98	0.79	0.79
	30,000–40,000	60.57	0.67	0.70	−49.09	0.67	0.79
	40,000–50,000	−71.43	0.70	0.70	−35.24	0.79	0.79
	50,000–60,000	134.62	0.48	0.70	133.86	0.41	0.79
	>60,000	311.22	0.23	0.70	367.34	0.04	0.13
SBPS – W	$\leq 20, 000$	273.53	0.17	0.70	323.78	0.06	0.35
	20,000–30,000	−139.57	0.44	0.70	−29.98	0.79	0.79
	30,000–40,000	−246.46	0.19	0.57	−133.32	0.23	0.45
	40,000–50,000	−71.43	0.70	0.70	−35.24	0.79	0.79
	50,000–60,000	134.62	0.48	0.70	133.86	0.41	0.79
	>60,000	193.62	0.40	0.70	283.66	0.14	0.42

In order to account for simultaneous estimation of effects for multiple subgroups, which is known as the multiple testing issue, we use the false discovery rate method by Benjamini and Hochberg²⁵ to calculate the adjusted p-values in Table 6. This method controls for the false discovery rate, or the expected proportion of rejected null hypotheses that are incorrect rejections. It is more powerful than methods based on the family-wise error rate, such as the Bonferroni correction. Judging from the adjusted p-value, the PS-X method yields significantly positive effect for the lowest income group when the direct matching estimate is used, and the SBPS-M method yields significantly positive effect for the lowest income group regardless of whether the direct matching estimate or the weighting estimate is used. This is consistent with our previous conjecture.

6 Discussion

Focusing on the case where the subgroups are defined by one categorical covariate, we have demonstrated the advantage of the SBPS methods, particularly the SBPS-M method combined with direct matching estimator, compared to several existing methods in estimating subgroup treatment effects. The proposed objective functions are designed to be natural aggregate measures of the overall and subgroup-specific covariate imbalance. The subgroup variable is selected a priori by the investigators. As advocated by Kent,^4,26 in comparative effectiveness research it is particularly important to stratify on the baseline risk score, also known as the prognostic score.²⁷ When such a risk score is available and validated from previous studies, e.g. the Framingham score in cardiovascular diseases, it is natural to perform subgroup analysis based on it.

The SBPS method can be extended to more general settings where the subgroups are defined by multiple covariates. Specifically, let G_h ( $h = 1, \dots, H$ ) denote the labels for covariates that define the subgroups, and suppose that G_h can take values among $1, \dots, R_{h}$ . Hence the total number of subgroups is $Π_{h = 1}^{H} R_{h}$ . Let X now denote the covariates other than those defining the subgroups. The true propensity score is denoted by $e (X, G_{1}, \dots, G_{H})$ . Let ${h_{1}, \dots, h_{j}}$ denote any j-subset (i.e., a subset with j elements) of ${1, \dots, H} (j \in {1, \dots, H})$ . It is easy to extend Proposition 1 to show that the true propensity score satisfies $X ⊥ Z | {G_{h_{1}} = r_{h_{1}}, \dots, G_{h_{j}} = r_{h_{j}}, e (X, G_{1}, \dots, G_{H})}$ , for any j-subset and any values $r_{h_{1}} \in {1, \dots, R_{h_{1}}}, \dots, r_{h_{j}} \in {1, \dots, R_{h_{j}}}$ ( $j \in {1, \dots, H}$ ). Hence besides balancing the distribution of X and $(G_{1}, \dots, G_{H})$ for the overall population, the true propensity score can also balance the distribution of X within each subgroup defined by any j-subset ( $j \in {1, \dots, H}$ ).

We now use an example with two subgrouping variables to illustrate the extension of the SBPS method. Here the subgroups are defined by age and baseline risk, then the true propensity score can balance: (a) the distribution of X ; (b) the distribution of age; (c) the distribution of baseline risk; (d) the distribution of X within each age category; (e) the distribution of X within each category of baseline risk; (f) the distribution of X within each subgroup cross-classified by age and risk. For each (lowest-level) subgroup cross-classified by age and risk, the propensity scores for units in this subgroup can be estimated using the overall sample, the sample for the corresponding age level (which includes units with the same age category but different risk levels), the sample for the corresponding risk level (which includes units with the same risk level but different age), or the subgroup-specific sample. The combination of estimation samples for all subgroups can be chosen in order to optimize an objective function F^M or F^W that accounts for the set of covariate-balancing moment conditions for all of (a)–(f) mentioned above. The SBPS can again be estimated by a stochastic search algorithm.

As the number of covariates increases, the above extension would quickly become intractable due to the rapidly increased moment balancing conditions. A possible direction is to use instead the overlap weights,⁸ which weight each unit by its propensity of being assigned to the opposite group, that is, each treated unit is weighted by 1 – e and each control unit is weighted by e. Specifically, Li et al.⁸ showed that when the propensity scores are estimated by maximum likelihood under a logistic regression model, the overlap weights lead to exact balance in the means of any included covariate between treatment and control groups, and the exact balance property applies to any included derived covariate, such as high order terms and interactions. Therefore, if the postulated propensity score model includes any interaction term of a binary covariate, then the overlap weights lead to exact balance in the means in the subgroups defined by that binary covariate. Considering any category variables are equivalent to a collection of binary dummy variables, we can leverage this exact balance property to subgroup analysis. For example, we can first use some high dimensional variable selection methods such as LASSO to select important interaction terms (which defines subgroups) in a propensity score model, then we estimate the propensity score using the selected main effects and interactions, finally we use these estimated propensity score to obtain the overlap weights.

In the cases of multiple subgrouping variables, a natural extension is to incorporate the potential correlation between different variables by considering a Mahalanobis-type balance metric in the objective function. Specifically, one could replace the standard deviation in our objective function (equations (14) and (15)) with the inverse of the correlation matrix of all covariates excluding the subgrouping variables. However, in the design stage, there is no guarantee that such an objective function will outperform the one with marginal moment conditions in terms of treatment effect estimation. Comparison between different balance metrics would require outcome information, which is not available in the design stage. Nonetheless, performance of such alternative balance metrics merits further investigation in empirical studies.

Much advance has been made recently in leveraging machine learning models in estimating heterogeneous treatment effects (HTE). Popular examples include the BART approach,⁵ tree methods such as Causal Forest,⁶ and causal boosting,²⁸ which are designed to identify the subpopulations with significant HTEs in a data-driven fashion. A main distinction between these methods and the proposed SBPS method is the design- versus analysis-based approach to causal inference. For example, SBPS, as well as CBPS or any propensity-score-based method, is design-based, which avoids modeling the outcome and instead focuses on ensuring covariate balance. In contrast, most of the above machine learning methods directly model the outcomes, bypassing the propensity scores and thus the balance issue. When the outcome model is correctly specified, such outcome-model-based method would produce valid and the most efficient estimates. However, the outcome model is almost always misspecified to some degree, threatening the validity of the causal conclusions. Indeed, in our simulations, the results of the Causal Forest method are generally inferior to those of SBPS in estimating the subgroup ATTs. This, of course, cannot be taken as a general evidence against the outcome-model-based methods to HTE. Rather, this highlights that comparison between the two streams (design versus analysis) of methods is highly case-dependent.

Footnotes

Acknowledgements

The authors are grateful to Peter Austin, Laine Thomas and three anonymous referees for their helpful comments that have greatly improved the exposition of this paper, and to Andrea Mercatanti for providing the SHIW data. Part of this paper was written when Jing Dong was an exchange student at Duke University under the High-Level Graduate Student Scholarship of China Scholarship Council.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: Fan Li's research is partially funded by NSF-SES grant 1424688.

References

Rosenbaum

Rubin

. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: 41–55.

D'Agostino

. Tutorial in biostatistics: propensity score methods for bias reduction in the comparisons of a treatment to a non-randomized control. Stat Med 1998; 17: 2265–2281.

Austin

. A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003. Stat Med 2008; 27: 2037–2049.

Kent

Rothwell

Ioannidis

, et al. Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal. Trials 2010; 11: 85.

Hill

. Bayesian nonparametric modeling for causal inference. J Computat Graph Stat 2011; 20: 217–240.

Wager

Athey

. Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc 2018; 113: 1228–1242.

Imai

Ratkovic

. Covariate balancing propensity score. J Royal Stat Soc: Ser B (Stat Methodol) 2014; 76: 243–263.

Lock Morgan

Zaslavsky

. Balancing covariates via propensity score weighting. J Am Stat Assoc 2018; 113: 390–400.

Thomas

. Addressing extreme propensity scores via the overlap weights. Am J Epidemiol 2019; 188: 250–257.

10.

Abadie

Imbens

. On the failure of the bootstrap for matching estimators. Econometrica 2008; 76: 1537–1557.

11.

Abadie

Imbens

. Large sample properties of matching estimators for average treatment effects. Econometrica 2006; 74: 235–267.

12.

Lumley

Scott

. Fitting regression models to survey data. Stat Sci 2017; 32: 265–278.

13.

Lumley

. Analysis of complex survey samples. J Stat Software 2004; 9: 1–19.

14.

Ridgeway G, McCaffrey D, Morral A, et al. twang: Toolkit for weighting and analysis of nonequivalent groups, 2017, https://CRAN.R-project.org/package=twang. R package version 1.5.

15.

Imbens GW and Rubin DB. Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge, UK: Cambridge University Press, 2015.

16.

Rosenbaum

Rubin

. The bias due to incomplete matching. Biometrics 1985; 41: 417–446.

17.

Rosenbaum

Rubin

. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat 1985; 39: 33–38.

18.

McCaffrey

Ridgeway

Morral

. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Meth 2004; 9: 403.

19.

Hansen

Heaton

Yaron

. Finite-sample properties of some alternative GMM estimators. J Business Economics Stat 1996; 14: 262–280.

20.

Mercatanti

. Do debit cards increase household spending? Evidence from a semiparametric causal analysis of a survey. Ann Appl Stat 2014; 8: 2485–2508.

21.

Thaler

. Mental accounting and consumer choice. Market Sci 1985; 4: 199–214.

22.

Thaler

. Anomalies: saving, fungibility, and mental accounts. J Economic Perspect 1990; 4: 193–205.

23.

Thaler

. Mental accounting matters. J Behavior Decision Making 1999; 12: 183–206.

24.

Keynes JM. General theory of employment, interest and money. India: Atlantic Publishers & Dist, 1936.

25.

Benjamini

Hochberg

. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc Ser B (Methodological) 1995; 57: 289–300.

26.

Kent

Hayward

. Limitations of applying summary results of clinical trials to individual patients: the need for risk stratification. JAMA 2007; 298: 1209–1212.

27.

Hansen

. The prognostic analogue of the propensity score. Biometrika 2008; 95: 481–488.

28.

Powers

Qian

Jung

, et al. Some methods for heterogeneous treatment effect estimation in high dimensions. Stat Med 2018; 37: 1767–1787.