We develop a novel approach for quantifying small effects in regression models. Our method is based on variation in the mean function, in contrast to methods that focus on regression coefficients. Our idea applies in diverse settings such as testing for a negligible trend and quantifying differences in regression functions across strata. Straightforward Bayesian methods are proposed for inference. Four examples are used to illustrate the ideas.
There is increasing interest in assessing whether differences between experimental conditions, even if present, are sufficiently small so as to be considered of no practical importance.1 These questions can often be formulated as tests for equivalence or tests for a negligible effect, a framework that provides a sensible and principled approach to operationalize a procedure for studies where the investigator may wish to verify the absence of a scientifically meaningful effect. Most assessments of negligible effects consider one or two-sample models for binary, continuous or time-to-event outcomes.
A general method for quantifying negligible effects has not been developed for regression models but special cases have been considered. Dixon and Pechman2 test whether a slope is negligible in a log-linear model. Wellek1 and Cheng and Shao3 present F-tests for negligible differences in two means, one-way and two-way ANOVA. To motivate the need for a general methodology, we highlight Selvin’s4 regression analysis of data from the California Child Health and Development Study (CHDS) of women on the Kaiser Health plan who received prenatal care and later gave birth in Kaiser clinics. Selvin4 considers whether a child’s birth weight can be predicted by the gestation length along with four maternal characteristics (age; cigarettes smoked/day; height; pre-pregnancy weight) and four paternal characteristics (age; years of education; cigarettes smoked/day; height). Selvin4 contrasts the contributions of the maternal and paternal features and concludes that the maternal features explain much more variation in birth weights. The set of paternal predictors is statistically significant at the 0.10 level but has little effect on the predicted birth weight. As one might a priori conjecture that the paternal predictors have limited predictive ability after accounting for maternal features, it is reasonable to assess whether the effect of these characteristics is negligible.
Our approach is based on the recognition that a predictor has little utility when the conditional mean response does not vary much as a function of that predictor. This suggests that an effect measure can be defined in terms of variation in means. We quantify the effect as a measure of discrepancy in the mean function which is averaged over the predictor space, roughly defined to be the set from which the covariate values are selected. The averaging distribution and discrepancy are problem specific, but natural choices are illustrated. Standard R2 measures arise as special cases.
Several novel problems can be addressed by focussing on mean responses. For example, the investigator can assess whether an effect is negligible over a subset of the predictor space and can quantify effects in regression splines, which are non-linear models when the knot locations are unknown. Also, models with different functional forms can be fit to subgroups and compared, for example in an assessment of whether between-stratum differences in regression functions are negligible.
The paper is organized as follows. Section 2.1 motives our approach. Sections 2.2 and 2.3 outline a Bayesian approach to inference, including tests to establish a negligible effect. Section 2.4 discusses the CHDS analysis. Section 2.5 provides general discussion on regression effects. Section 2.6 raises concerns on R2 as a summary for ANOVA models and suggests alternative effects based on constraints. Section 3 defines a general measure of regression effect that is suited for structured models such as splines and polynomials. Section 4 provides a framework for quantifying variation in structured regression models across strata. Section 5 gives concluding comments.
2 Quantifying small effects in regression models
2.1 General ideas
We define vectors to be column vectors and set for random vectors and . Sample means and covariance matrices are labelled and
Assume that the distribution of Y given satisfies the model for where and The predictor has no effect on the conditional mean of the responses when the regression function is constant on This suggests that a measure of the regression effect can be based on variation in the regression function, for example
where the variance is computed for a selected distribution on , treating as fixed. An alternative effect that has an interpretation as explained variance is . If is positive definite then if and only if .
The distribution on can be viewed multiple ways. One possibility is a sampling distribution for in a joint model for . Another view is that this distribution is solely an averaging distribution or device for quantifying the deviation of the regression function from a constant. In the first setting, is the minimum mean squared error predictor of Y. The accuracy of this predictor is measured by the correlation ratio, which is when and . Also, is the percentage of that is explained by . This interpretation is central to our calibration of even though we derived this summary differently.
Regardless of how the distribution of is viewed, inferences on and will be based on the likelihood from the regression model (i.e. conditional on observed covariates). This requires the distribution of to be completely specified and to be given. The empirical distribution for , which assigns probability to each , is a logical averaging distribution because this option only assumes that the model holds for . This choice is appealing when covariate levels are fixed by design and gives , which is the population R2 for the model.5 An alternative option with continuous predictors is to specify that a sampling or averaging density for has . This also gives so different perspectives on the role of lead to the same R2 effect. The regression effect is different from R2 when .
To extend our idea to a more general setting, suppose that the distribution of Y given satisfies the linear model
where and We note that has no effect on the conditional mean response when does not vary as a function of regardless of Thus, reasonable summaries of the effect of are given by
and where and expectations are relative to a specific distribution for treating as fixed. A variety of distributions might be sensible, but a simple data-based choice with continuous predictors is to assume has a normal distribution with mean and covariance matrix given by the corresponding moments of the observed s. For this option, does not depend on This gives and , the population partial R2.
An alternative derivation of follows from reparametrizing equation (2) as where for The s result from orthogonalizing relative to the other effects in equation (2). As a result, and are equivalent for predicting Y after accounting for Details are provided in Appendix 1. Treating the reparametrized model as a specification for the distribution of Y given the effect of is If we assume that the marginal distribution of is the empirical distribution of the s and that and are independent, which is reasonable given that and are uncorrelated, then , where is the covariance matrix of . Thus this simpler averaging distribution also gives
To focus the discussion, Sections 2.2–2.5 consider as a summary, with an understanding that when is absent from the regression model. Subsequent sections will further examine the specification of the distribution of and suggest averaging distributions that produce alternatives to as regression effects.
2.2 Bayesian inference
Let be the parameter vector for equation (2) assuming the ei are independent and normally distributed. Bayesian inference for is straightforward using Monte Carlo Markov Chain (MCMC) or related simulation methods to generate a sample from the posterior distribution of . To calibrate , note that is often viewed as a small effect.6 However, the issue is more nuanced because it may be sensible to set this threshold as a function of the reduced model As a slightly more liberal rule of thumb, we consider the effect of to be negligible if the posterior 50th and 95th percentiles of are at most 0.02 and 0.04, respectively, provided is not strongly associated with Y (i.e. posterior median for reduced model ).
2.3 A Bayes test to establish a negligible effect
Our primary focus is the estimation of regression effects. However, we will briefly examine hypothesis testing, as this is a main interest in the literature. A Bayes test to establish a negligible effect of contrasts and , where Δ is a user-defined and problem specific tolerance. For the standard 0-1 loss function, the H0 is rejected when , or equivalently, when the posterior median of Equivalent hypotheses are and where and . We often use and .
Wellek’s1 ANOVA F-tests to establish a negligible effect specify Ha as the negligible effect. This is consistent with a frequentist’s goal to establish that the effect, even if present, is small enough to be ignored. We follow this convention but stress that the Bayes test does not depend on which hypothesis is labelled To contrast a Bayes and frequentist approach in equation (2), note that the F-statistic for testing has a distribution,7 which reduces a central F-distribution when Appendix 1 gives details. This result can be used to construct a size α test1 for leading to the same hypotheses considered by the Bayes test. In particular, is rejected in favor of when , where is the percentile of the reference non-central F-distribution. In essence, large F-values support a conclusion that a regression effect exists whereas small values support a conclusion that the effect is negligible. The two F-tests use the same statistic but with different reference distributions to address different questions. Both conclusions, or one or neither, are possible. This issue may also arise in Bayes tests.
2.4 CHDS analysis
Selvin4 considers the model to predict the birth weight (in pounds) of 680 live-born white male infants, where is the vector of paternal features (PA = age; PE = years of education; PS = cigarettes smoked/day; PH = height), and is the vector with gestation length (GL) and four maternal features (MA = age; MS = cigarettes smoked/day; MH = height; MW = pre-pregnancy weight). Table 1 gives posterior means and standard deviations (SDs) for based on a reference prior8 The predictors were standardized so that the regression coefficients are in one pound units. The estimates are based on 10,000 posterior samples and match a least squares (LS) analysis given the large sample size. The posterior mean and SD of σ are 0.95 and 0.03, respectively. The posteriors for several regression parameters (GL, MS, MH, MW and PH) are concentrated away from zero but only gestation length has much impact on mean birth weight.
Posterior means and SDs for regression coefficients in CHDS analysis.
Effect
Mean
SD
Constant
7.517
0.036
GL
0.438
0.037
MA
−0.007
0.064
MS
−0.172
0.038
MH
0.099
0.044
MW
0.151
0.042
PA
0.000
0.064
PE
0.009
0.039
PS
0.026
0.039
PH
0.103
0.039
As mentioned in Section 1, one might hypothesize that the paternal predictors have limited information after accounting for GL and maternal features. In Selvin’s4 frequentist analysis, adding to the model increases the estimated R2 from 0.252 to 0.262. Although the four paternal characteristics are significant at the 10% level (p = 0.09), the tiny estimated partial suggests that these features have minimal impact on prediction. This is reinforced by the F-test to establish that the effect of is negligible, which has a p-value of 0.09 when
A Bayesian analysis leads to a similar conclusion. If one considers the question as an issue of model selection, then the difference between the Deviance Information Criterion8 (DIC) for the full model and the reduced model without the paternal features () does not suggest a clear preference. Least squares estimates were used to compute the effective number of parameters in DIC but an identical conclusion was obtained using other estimators. Although one might chose the simpler model based on parsimony, this decision is bolstered by noting that the posterior distribution of is concentrated near zero; see Figure 1. In particular, the posterior median of is 0.016 and , which suggest that the effect of paternal features on prediction is negligible using our benchmarks.
Posterior density of for CHDS analysis.
2.5 Discussion of regression effects
The specification of a regression effect should reflect the purpose of the summary, which can range from reporting an effect size to addressing explanatory power. The units of measurement also play a role when deciding whether the summary should be standardized. Wilkinson9 recommends regression coefficients as effect sizes when the units of measurement have practical interpretation, as in the CHDS analysis. Otherwise they suggest a standardized measure such as , which we consider given our focus on prediction.
The uncertainty of an estimated effect impacts whether an effect is negligible. Small non-null effects may be estimated well enough to support a conclusion that they are negligible but poorly estimated null effects may not. Furthermore, conflicting conclusions can be reached from different effect measures. The CHDS analysis highlights some of these points. In particular, there is a 0.99 posterior probability that the paternal height (PH) regression coefficient exceeds zero and DIC increases by 5.30 when PH is omitted from the model, both of which support a conclusion that PH improves the fit of the model. However, from adding PH last to the model, which suggests that PH has a negligible effect on prediction.
2.6 Alternatives to
It is important to emphasize that is a measure of predictive ability that is specific to the observed pairs and depends on the distribution of these pairs through These pairs may have been fixed by design, simply observed as in the CHDS analysis, or resulted from sampling. We follow standard practice and treat the s as fixed regardless of the process for choosing these pairs. Weisberg10 argues that is most meaningful when covariate values are randomly sampled, in part, because the value of can be dramatically decreased simply by restricting the range of the observed pairs.
There are two clear ways to alter the dependence of and on the distribution of the observed predictor values. One approach that is implicit in Section 2.1 and illustrated in Sections 3 and 4 is to include unobserved pairs in the definition of the effect. This requires an assumption that the model extends beyond the observed covariate pairs. A second approach that has broad application in ANOVA models is to modify the weight assigned to each unique pair. This can often be done by using the constraints implied by the reduced model with to define effects. For example, consider a one-way ANOVA model for observations in groups The F-test1 to establish negligible differences in means has , where and . For this model, and have the undesirable property of depending on group weights. Let and note that the reduced model of equal means satisfies the constraints for . A simple measure of a non-null effect that is independent of the group sample sizes is . The same issue arises in a test for negligible interaction in two-way ANOVA,3 with an analogous remedy.
Bayesian inference on effect measures is straightforward, so deciding that an alternative to is more appropriate poses no real challenges. In contrast, linear model inferences based on F-statistics are restricted to as a summary measure. Bootstrap methods may provide a general frequentist approach to inference on regression effects, but this has not been extensively explored in the literature.
3 A general effect measure
3.1 Development
We defined linear models effects based on variation in the mean function. This idea applies to non-linear models and generalizes to other measures of dispersion. Bayesian inferences are straightforward. For example, suppose with , where depends on predictors and parameters . We define a general measure of regression effect over a set Ω as
where Here p is fixed, s > 0 is a standardization, is a distribution over Ω and is the average of the regression function over The summary is an Lp-distance measure, with larger values indicative of a larger effect. The notation does not reflect that is a function of parameters.
To relate equation (4) to earlier summaries, note when This reduces to equation (1) in the linear model and in a one-way ANOVA where and for group labels More generally, The summary is proportional to the mean absolute deviation in which is an alternative to as a measure of dispersion in the regression function.
To give an example where equation (4) is useful, suppose and where the elements of are a basis for functions that admit a finite series representation, for example a spline or polynomial. The covariate values depend on observed values for a scalar feature Z, for example age, and may depend on a parameter vector , for example the knots in a spline. The model is non-linear when is unknown, so linear models effects do not apply. In general, the effect of Z is negligible if the mean function is nearly constant over the predictor space. However, with a flexible spline or polynomial model it is also meaningful to assess whether the effect is negligible over subsets of the predictor space. For example, Figure 2 plots a quadratic spline with a knot at κ = 6 and the average of the spline over and . Here when and 0 otherwise. The spline varies much less over than , which we interpret as a weaker effect of z over than is observed overall.
Plot of spline function with mean of curve over two regions.
We can use to measure variation in over an interval Ω. Also, a continuous distribution for is often sensible for splines and polynomials because it is plausible to assume that the model holds over the entire predictor space. As a result, equation (4) need not depend exclusively on the observed covariate levels. One can also consider multiple For example, the local variation in can be described by setting for some c > 0 and then varying z across the predictor space.
3.2 Selecting a summary
The effect is intended to be flexible and easily interpreted, with problem-specific choices for s, p, and Standard options are or and p = 1, 2 or the limiting case . Natural choices for are the empirical distribution on a discrete uniform distribution on the unique s, a uniform distribution on the predictor space or an interval or the distribution of when are sampled.
The specifications for s and p should depend on the role of the summary and the units of measurement, as noted in Section 2.5. To contrast options for s, suppose A measure of predictive ability results when because is the proportion of explained by . Alternatively, is the squared coefficient of variation of , which is a descriptive summary of relative variation in the regression function. These two measures are unitless whereas has the same units as Y. This summary may be difficult to calibrate if the units are unnatural and ill considered if prediction is the interest. Another consideration is that is less robust to extreme deviations from for larger values of p, with providing an assessment of whether the variation in is uniformly small over Ω. We prefer p = 1 to p = 2 for a descriptive summary because is proportional to the mean absolute deviation in which we find easier to interpret than or .
Some guidance can be given for calibrating the posterior median of for p = 1, 2 and Given the connection with R2, we propose as a bound for . Appendix 2 suggests and with b = 0.50 and b = 1. The posterior median bound is liberal and follows from Jensen’s inequality. These bounds can be used in tests to establish a negligible effect.
3.3 Example: Peptide analysis
Figure 3 plots the logarithm of serum C-peptide in pmol/ml (Y) against age (z) in years for 43 participants in a study of insulin-dependent childhood diabetes.11 A least squares fit to a quadratic spline model with a knot at the average age (8.80 years) is superimposed on the plot. Here with . The curve mimics the observed trend reasonably well, and varies by two across ages but only by 0.10 for (approximately the maximum age).
Plot of insulin data.
As an illustration, suppose we are interested in the variation in over ages treating κ as fixed. We consider with uniform on [8,15] (i.e. ) and a reference prior . If we view a 5% relative variation about to be unimportant, then based on 20,000 posterior samples. Ignoring that this evaluation was suggested by the data, this analysis strongly suggests that the variation in the regression curve relative to the average level over is unimportant. If we consider instead then . This second analysis gives some support, but does not strongly suggest, that the predictive ability of the spline is negligible over Not surprisingly, the choice of summary has implications on the conclusion.
4 Group comparisons in structured models
4.1 Motivating examples
We consider two analyses that highlight extensions of equation (4) to regression models with predictors and a factor. As background for our first analysis, Burt12 discusses a study of 27 identical twins raised apart, one by foster parents and one by the natural parents. An interest is whether the relationship between their IQ scores varies by the social stratum of the natural parents (low, middle, high). To answer this question, Weisberg10 considers the model where Ykj is the IQ score for the twin raised by a foster parent for pair j in stratum k and zkj is the IQ score for the twin raised by the natural parents. A plot of the data shows strong linear relationships with minor differences across strata. Weisberg10 concludes based upon an F-test that the lines do not vary significantly across strata, which might suggest that the environmental effect on intelligence is unimportant. An alternative analysis is to evaluate whether differences in the regression functions are negligible. Although this may be assessed using , Section 4.3 will consider an extension of equation (4) that eliminates dependence of the effect on the design matrix.
As another example, Figure 4 plots a time trend of the logarithm of the incidence rates (per 100,000) for melanoma of the skin for non-Hispanic white males and females. The rates are from SEER (Surveillance, Epidemiology and End Results) 13 study areas (San Francisco, Connecticut, Detroit, Hawaii, Iowa, New Mexico, Seattle, Utah, Alaska native registry and rural Georgia) for years 1992–2011, age adjusted to the 2000 US standard population. The standard errors of the estimates are about 0.02.
Non-Hispanic white incidence rates for melanoma of the skin (SEER 13).
A standard modeling approach for rates is to fit splines to the sex-specific log-incidence rates.13 We consider a linear spline for males and a quadratic spline for females, each with one knot. We assume that the knot locations are unknown and have independent discrete uniform distributions from years 1995–2008. Gibbs sampling was used to simulate the posterior distribution of where based on large sample normal approximations to the distribution of given κi and the data, assuming independence between the male and female models.
Figure 4 plots the posterior mean of the regression functions based on 5000 samples after a 40,000 sample burn-in. The estimated trends closely follow the data. The temporal increase in rates evident in Figure 4 mirrors worldwide trends, which led to the coining of the phrase “the melanoma epidemic.” However, the sex-specific age-adjusted annual U.S. death rates for melanoma have remained relatively constant from 1990 to 2006.14 Recent research15 suggests that the inconsistency between the trends in incidence and mortality rates is caused to a large degree by the diagnosis of an increased number of early-stage melanomas, possibly due to more intense screening. Interestingly, the between-sex difference in log-rates has stayed relative constant, ranging from 0.37 to 0.44 over the 20 year period. We quantify this interaction in Section 4.4, taking into account that the models are different and nonlinear.
4.2 Statistical approach
We consider measures of variation and interaction among regression functions when data are available from multiple groups or strata. We assume for and where and Here , where and is a stratum-specific covariate that is function of z and for The dimension of may vary by strata. This allows different models by strata, for example, different order polynomials, or splines. The parameter may be absent or fixed, which is the case in standard polynomials and splines where the knot locations are assumed known.
A natural pointwise measure of differences among the mean functions is the discrepancy where or and . The discrepancy is zero at if and only if the regression functions are equal . Note that is the squared coefficient of variation in the means at z. An overall measure of differences among the regression functions, is obtained by averaging with respect to a distribution over as in equation (4).
A characterization of no interaction is that is independent of for all (i, j) or equivalently that does not vary across strata, where . An aggregate measure of interaction, is obtained by averaging as in equation (4), where s = 1 or σ and . The difference between and is that centers to have mean 0 over
4.3 Example: IQ scores
We assessed the relative differences among regression functions using with We considered and a reference prior. The second distribution approximates the IQ distribution in the general population. The posterior 50th and 95th percentiles of based on 20,000 samples are 0.04 and 0.07, respectively, for each of the two distributions Thus, the mean absolute deviation relative to the mean of the regression functions is typically about 0.04. Although a subject matter expert is needed to assess whether this small difference is important, we note that the simpler model of equal regression lines is preferred to the general model based on DIC, as , with slight differences in DICs depending on the method used to estimate the effective number of parameters. An interesting historical detail is that Burt’s research on the heritability of intelligence has been extensively scrutinized, with accusations of falsification of research data.16
4.4 Example: Incidence rates for melanoma
We summarized the interaction using the unstandardized (i.e. s = 1) and The posterior 50th and 95th percentiles of are 0.01 and 0.02, respectively. The 95th percentile is modest relative to the typical yearly difference (0.40) in the observed log-rates between sexes and suggests that differences between the male and female log-rates have not varied much over the 20 year period.
5 Concluding remarks
We presented a novel approach for quantifying regression effects which is based on variation in the mean function. Standard R2 measures arise as special cases but our general approach can also be used to define effects that are less dependent on the configuration of predictor values. Similar extensions apply to linear models with general covariance structures and to mixed models.
We believe that our measures help with the interpretation of small effects and can provide guidance in model simplification. For example, in the CHDS analysis, the negligible effect of paternal covariates on birth weight might suggest a simpler model that excludes these characteristics, a decision that is supported by the Deviance Information Criterion (DIC). We recognize that there is arbitrariness in our general method, as an analyst must specify and However, this could be viewed as a strength rather than a limitation as the question of interest can inform the specification. A sensitivity analysis is recommended when choices are not clear.
We showed that our ideas extend to non-linear models, but caution is needed. For example, suppose for strata A means-based effect measure may be reasonable to address whether the regression functions vary negligibly across strata. However, if μk is the inverse-link function for a generalized linear model, then an analysis of interaction may be most meaningful on the link-transformed scale, as was considered in the melanoma analysis. The primary consideration is that the effect should be defined on the most meaningful scale for inference.
We used improper priors in our analyses. With proper priors, it is easy to tune the prior so that the prior probability of a non-negligible effect is specified. For example, in the CHDS analysis if is independent of then and 0.08 when and 1, respectively. Not surprisingly, increases as the prior becomes more diffuse.
Footnotes
Acknowledgements
The authors would like to thank the Editor and two reviewers for their extensive comments on the manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
1.
WellekS. Testing statistical hypotheses of equivalence and noninferiority, 2nd ed. Boca Raton, FL: Chapman and Hall/CRC, 2010.
2.
DixonPMPechmanJHK. A statistical test to show negligible trend. Ecology2005; 86: 1751–1756.
3.
ChengBShaoJ. Exact tests for negligible interaction in two-way analysis of variance/covariance. Stat Sin2007; 17: 1441–1455.
4.
SelvinS. Practical biostatistical methods, New York: Duxbury Press, 1995.
5.
GatsonisCSampsonAR. Multiple correlation: exact power and sample size calculations. Psychol Bull1989; 106: 516–524.
6.
CohenJ. A power primer. Psychol Bull1992; 112: 155–159.
7.
ChristensenR. Plane answers to complex questions: the theory of linear models, 3rd ed. New York: Springer, 2002.
8.
ChristensenRJohnsonWBranscumAet al.Bayesian ideas and data analysis: an introduction for scientists and statisticians, New York: CRC Press, 2011.
9.
WilkinsonL;. The Task Force on Statistical Inference. Statistical methods in psychology journals: guidelines and explanations. Am Psychol1999; 54: 594–604.
10.
WeisbergS. Applied linear regression, 2nd ed. New York: John Wiley, 1985.
11.
SockettEBDanemanDClarsonCet al.Factors affecting and patterns of residual insulin secretion during the first year of type I (insulin dependent) diabetes mellitus in children. Diabetes1987; 30: 453–459.
12.
BurtC. The genetic determination of differences in intelligence: a study of monozygotic twins reared together and apart. Br J Psychol1966; 57: 137–153.
13.
KimHJFayMFeuerEJet al.Permutation tests for regression with application to cancer rates. Stat Med2000; 19: 335–351.
14.
Horner MJ, et al. SEER cancer statistics review, 1975–2006. Bethesda, MD: National Cancer Institute (based on November 2008 SEER data submission, posted to the SEER website, 2009. Table I-21, US Prevalence Counts, Invasive Cancers Only, 1 January 2006, Using Different Tumor Inclusion Criteria), http://seer.cancer.gov/csr/1975_2006/results_single/sect_01_table.21_2pgs.pdf (accessed 15 September 2015).
15.
FrangosJEDuncanLMPirisAet al.Increased diagnosis of thin superficial spreading melanomas: a 20-year study. J Am Acad Dermatol2012; 67: 387–394.
16.
MackintoshNJ. Cyril Burt: fraud or framed?, Oxford: Oxford University Press, 1995.
17.
JohnsonNLKotzS. Continuous univariate distributions, I, New York: John Wiley, 1970.