Abstract
Background:
Randomized experiments are often considered the strongest designs to study the impact of educational interventions. Perhaps the most prevalent class of designs used in large-scale education experiments is the cluster randomized design in which entire schools are assigned to treatments. In cluster randomized trials that assign schools to treatments within a set of school districts, the statistical power of the test for treatment effects depends on the within-district school-level intraclass correlation (ICC). Hedges and Hedberg (2014) recently computed within-district ICC values in 11 states using three-level models (students in schools in districts) that pooled results across all the districts within each state. Although values from these analyses are useful when working with a representative sample of districts, they may be misleading for other samples of districts because the magnitude of ICCs appears to be related to district size. To plan studies with small or nonrepresentative samples of districts, better information are needed about the relation of within-district school-level ICCs to district size.
Objective:
Our objective is to explore the relation between district size and within-district ICCs to provide reference values for math and reading achievement for Grades 3–8 by district size, poverty level, and urbanicity level. These values are not derived from pooling across all districts within a state as in previous work but are based on the direct calculation of within-district school-level ICCs for each school district.
Research Design:
We use mixed models to estimate over 7,000 district-specific ICCs for math and reading achievement in 11 states and for Grades 3–8. We then perform a random effects meta-analysis on the estimated within-district ICCs. Our analysis is performed by grade and subject for different strata designated by district size (number of schools), urbanicity, and poverty rates.
Introduction
Randomized experiments are often used to evaluate the impact of educational interventions, products, or services. Over the past decade, the number of experiments in education funded by Federal sources has increased considerably (Spybrook & Raudenbush, 2009). Yet, the number of experimental studies reported in the literature has not kept pace (Spybrook, Puente, & Lininger, 2011), possibly due to null findings from studies that are underpowered. The most common experimental designs used in education are cluster randomized trials (CRTs) that assign whole schools to treatments and where the schools are nested within districts.
CRTs are commonly used to evaluate interventions in other fields such as health (Hayes, Moulton, & Press, 2009) and have been embraced by the education community. The primary reasons for their adoption is the containment of “spillover” or “contamination” effects (the mixing of treatment and control conditions in a common place) and the efficiencies in delivering place-based services (Bloom, 2005). Blocking (controlling for the effects of schools with similar characteristics) is another common practice in such experiments, with school districts serving as a naturally occurring characteristic on which to block schools. Cluster randomized designs incorporating blocks such as districts are sometimes called multisite CRTs (MSCRTs).
In multilevel designs, the precision of estimates of treatment effects, the statistical power to detect effects, and the minimum effect size that is detectable with a given level of certainty (the minimum detectable effect size) all depend (in part) on the variance decomposition between and within schools (Bloom, 2005; Bloom, Bos, & Lee, 1999; Hedges & Rhoads, 2011; Raudenbush, 1997). In two-level designs assigning schools to treatments, this variance decomposition is typically summarized by the school-level intraclass correlation (ICC) coefficient (ρ), which is the proportion of the total variance that occurs between schools. Therefore, planning the sample sizes for a CRT or an MSCRT requires knowledge of the likely value of the school-level ICC coefficient.
The purpose of this article is to explore the distribution of within-district ICCs and to provide guidance on ICC values for mathematics and reading achievement across districts of varying size, urbanicity, and levels of poverty. These values will be especially useful to evaluators employing CRTs where schools (clusters) are assigned to treatment condition but are located within a set of districts (blocked sites). The values presented in this article are unique in that they are not based on three-level (students in schools in districts) mixed models that include entire state data systems (as in Hedges & Hedberg, 2014) but are instead based on school-level ICCs estimated from individual districts. We then summarize these district specific ICCs with a random effects meta-analysis by grade and district subgroups.
CRTs
The values provided in this article have specific use for CRTs or MSCRTs where schools are the level of randomization but are blocked by district fixed effects.
1
Blocked CRT designs in education are usually three-level designs because they involve three-stage sampling where districts (sites) are selected first, then schools (which are statistical clusters), and finally individuals within schools. However, when the number of districts is small, they may be considered to have fixed effects since modeling with so few districts would not produce reliable variance components (and thus district effects may be modeled as a set of dummy variables so that the model reduces to two levels of random effects). Thus, the model for this design is a two-level model predicting the outcome for the ith student in school j in district k, such as (Spybrook & Raudenbush, 2009):
where
Given this model, the statistical power of the test for the treatment effect depends on the sample sizes and two other parameters, that is, the within-district ICC and the effect size. The within-district school-level ICC is defined as follows:
The effect size (based on the total variation), δ, is defined as:
The three components to the sample size are as follows: the number of districts (sites) selected (K), the number of schools (clusters) per district (J), and the number of individuals within each school (n), which we will assume here are equal in each cluster for simplicity of exposition.
One method to produce a test statistic for testing the null hypothesis of no treatment effect employs the F sampling distribution with 1 and K(J − 2) degrees of freedom. Under the alternative hypothesis, the test statistic has the noncentral F distribution and has a noncentrality parameter as follows:
The power of the design is the inverse cumulative (upper tail or survivor) noncentral F distribution employing this noncentrality parameter and degrees of freedom, that is,1 and K(J − 2). For example, the power for a design with effect size 0.2, n = 20, J = 10, K = 12, and an ICC of .17 is 0.65. 2
Finally, it turns out that many different combinations of K, J, and n may give identical (or nearly identical) statistical power. The so-called optimal design or optimal allocation methods (which maximize precision or statistical power for a given cost function) are often used to assist in planning cluster randomized designs (see, e.g., Raudenbush, 1997). Optimal allocation depends on cost data and is also a function of the school-level ICC.
In summary, information about within-district school-level ICCs is crucial in planning experiments that use cluster randomized designs conducted either within a single district or using districts as blocks. ICCs are vital to both estimating the statistical power for a given design and optimally allocating resources to schools and students. This study adds to the empirical data about such values.
Previous Studies of Design Parameters
Several authors have assembled empirical evidence about ICCs to aide researchers in planning cluster randomized designs. For example, Bloom, Richburg-Hayes, and Black (2007) reported ICCs at several grade levels from five large urban school districts in the Eastern United States that had participated in evaluation studies. Bloom et al. (2008) extended the work of Bloom et al. (2007) to provide school-level parameters that extend beyond test scores and include other academic-related outcomes in the same five school districts, also providing ICCs for classrooms within schools. Brandon, Harrison, and Lawton (2013), in work that provides SAS code for estimating ICCs, also provide upper bound values for Hawaii, a state that is a single school district. Finally, Schochet (2008) provides values for ICCs based on large evaluation studies, but few of these are within-district values.
It is important that the variances of ICC estimates are inversely proportional to the number of schools used and therefore ICC estimates from individual randomized trials (even relatively large ones) are subject to rather large sampling uncertainties (large standard errors). The same thing is true of ICC estimates from all but the largest school districts. Thus, the unrepresentative nature of the samples and large sampling uncertainties of estimates given in the studies cited earlier make them suboptimal as reference values for planning CRTs.
To provide ICC estimates from larger and more representative samples, Hedges and Hedberg (2007a) used a set of surveys with large (hundreds to thousands of schools) national probability samples to estimate school-level ICC values for reading and mathematics achievement from kindergarten through Grade 12. ICCs for rural areas were published in Hedges and Hedberg (2007b). Hedges and Hedberg (2011) also provide ICC estimates by grade, region, and certain school characteristics (such as socioeconomic status, achievement level, and urbanicity) via the so-called Online Variance Almanac (https://arc.uchicago.edu/reese/variance-almanac-academic-achievement). The ICC estimates are nationally representative and have acceptably small standard errors. However, the sampling designs of the surveys used did not permit the estimation of between-school district variation. Consequently between-district variation is pooled into between-school variation in the ICC estimates that were computed, which means that the ICCs computed are overestimates of the school-level ICCs (based on three-level models) that are relevant for planning CRTs that use one or a few districts.
To obtain better estimates of within-district school-level ICCs, Hedges and Hedberg (2014) expanded their national database of ICCs by providing values for reading and mathematics achievement based on the analyses of State Longitudinal Data Systems (SLDS) in 11 states (Arkansas, Arizona, Colorado, Florida, Louisiana, Kansas, Kentucky, Massachusetts, North Carolina, West Virginia, and Wisconsin; see http://www.ipr.northwestern.edu/research-areas/designparameters/stateva.html). For evaluations across schools where the investigative team is not concerned with school district effects, they provide school-level ICCs based on two-level models that pool district-level variation into school-level variation. They also provide estimates of the ICC system from three-level models (i.e., an ICC for district-level effects and another ICC for school-level effects). Westine, Spybrook, and Taylor (2014) provide similar values based on SLDS systems for science outcomes.
Why Additional ICC Estimates Are Necessary
The school-level ICC values derived from the statewide three-level models are useful for planning designs that employ a representative sample of districts from a state. However, the research reported in this article demonstrates that within-district school-level ICCs are not constant throughout states but depend on characteristics of districts, particularly on the number of schools in the district (district size). Therefore, pooled state within-district ICCs may be an average of dissimilar values that underestimates the ICCs in large districts and overestimates the ICCs in small districts. Moreover, because the pooled state average within-district ICCs give more weight to large districts (because they contribute more information), the pooled state average ICC estimates are particularly poor estimates of the ICCs in smaller school districts. Thus, the estimates of Hedges and Hedberg (2014) based on SLDSs may not be ideal for planning a CRT using a small number of districts, particularly if the districts in the CRT sample are not representative of the state (e.g., if they are smaller districts).
A review of recent published randomized control trials (RCTs) suggests that the typical RCT uses a small number of districts, usually just one or two. All studies reviewed whether the intervention randomized at the student level used three or fewer districts and whether half of the studies that randomized at the class or school level also used three or fewer districts. Overall, 66% of all studies reviewed used three or fewer districts. This is consistent with the idea that most researchers use local education agencies near their institutions to recruit participants and have insufficient resources to manage more than a handful of districts. These results are based on a review of over 20 published articles and reports over the last 3 years, primary from the American Educational Research Journal, Educational Evaluation and Policy Analysis, Journal of Research on Educational Effectiveness, and The Journal of Experimental Education, which are primary outlets for education experiments (see Agodini, Harris, Thomas, Murphy, & Gallagher, 2010; Bottge, Grant, Stephens, & Rueda, 2010; Bradshaw, Mitchell, & Leaf, 2010; Calderón, Slavin, & Sánchez, 2011; Fantuzzo, Gadsden, & McDermott, 2011; Fulmer & Frijters, 2011; Gersten, Dimino, Jayanthi, Kim, & Santoro, 2010; Goodson et al., 2011; Hamre et al., 2012; Isenberg et al., 2009; Kim, Capotosto, Hartry, & Fitzgerald, 2011; Lane et al., 2011; Laura, McMeeking, Orsi, & Cobb, 2012; Marley, Levin, & Glenberg, 2010; Marley, Szabo, Levin, & Glenberg, 2011; McQuillin, Smith, & Strait, 2011; Olson et al., 2012; Phelan, Choi, Vendlinski, Baker, & Herman, 2011; Reis, McCoach, Little, Muller, & Kaniskan, 2011; Rose, Woolley, Orthner, Akos, & Jones-Sanpei, 2012; Sarama, Clements, Wolfe, & Spitler, 2012; Slavin, Cheung, Holmes, Madden, & Chamberlain, 2012; Springer et al., 2012; VanDerHeyden, McLaughlin, Algina, & Snyder, 2012; Vaughn, Klingner, et al., 2011; Vaughn, Wexler, et al., 2011; Wirkala & Kuhn, 2011; Wolf et al., 2010).
Analysis Plan
The purpose of our analysis is to estimate typical within-district ICCs by subject, grade, district size, urbanicity, and poverty status. Our analysis follows three steps. First, specific to subject and grade, we estimate district-specific school-level ICCs using 11 state data systems: Arkansas, Arizona, Colorado, Florida, Kansas, Kentucky, Louisiana, Massachusetts, North Carolina, West Virginia, and Wisconsin. All data were from the 2009–2010 school year, with the exception of Florida, which supplied data from the 2006–2007 school year, Louisiana, which supplied data from the 2012–2013 school year, and West Virginia, which supplied data from the 2011–2012 school year. Since all states test in Grades 3–8, we focused our analysis only on these grades.
Whether a district was included in the analysis was evaluated separately for each grade. Eligible districts were those that had test scores in at least two schools that served a particular grade (since ICCs are undefined in a district with a single school) and had a harmonic mean number of at least two student scores per school. We use the harmonic mean since it is less prone to outliers. We used a threshold of two students because the variance in the ICC, given subsequently in Equation 8, increases exponentially for harmonic means of fewer than two students (regardless of the value of the ICC).
Each state employed a different achievement test, namely the Augmented Benchmark Examination (Arkansas), Arizona’s Instrument to Measure Standards, Colorado’s Student Assessment Program the Florida Comprehensive Assessment Test, the Kansas Assessment Program, the Commonwealth Accountability Testing System (Kentucky), Louisiana’s Integrated Educational Assessment Program, Massachusetts Comprehensive Assessment System, the North Carolina End of Grade Tests, West Virginia’s WESTEST, and the Wisconsin Knowledge and Concepts Examination.
Second, we compiled our district-specific ICCs into a database and assigned subgroup identifiers. Employing the Common Core of Data (Keaton, 2012), we estimated the 10th, 25th, and 50th percentiles of district size that serve students in Grades 3, 4, 5, 6, 7, and 8. Our percentile analysis used the student count to weight the school records. Thus, we found the district size percentiles from the student point of view. The 10th percentile means that 10% of students are served by a district of a particular size. Weighting the districts by students served by grade, we found for Grades 3 and 4 that the 10th percentile of district size was 3 schools, the 25th percentile was 5 schools, and the 50th percentile was 10 schools. In Grades 5 and 6, the 10th, 25th, and 50th percentiles were 2, 5, and 11 schools, respectively. In Grade 7, the 10th, 25th, and 50th percentiles were 2, 3, and 6 schools, respectively. Finally, in Grade 8, the 10th, 25th, and 50th percentiles were 2, 3, and 7 schools, respectively. The sample of district-specific ICCs was then divided into four groups for each grade, using the 10th, 25th, and 50th percentiles of district size as cut points. These size grouping are noted as “very small,” “small,” “medium,” and “large” districts. We include these school sizes in the results tables for clarity.
Finally, we summarize the district-specific ICCs using a random effects meta-analytic approach (Borenstein, Hedges, Higgins, & Rothstein, 2011; Hedges & Vevea, 1998) as detailed subsequently. We do this by grade and subject and also for district size, poverty status, and urbanicity groups. Poverty is defined as a two-group variable indicating that the district has either (a) fewer than 50% of its students eligible for free or reduced-price lunch or (b) 50% or more of students are eligible for free or reduced-priced lunch. Urbanicity is also a two-group variable indicating that the district is either (a) primarily not in an urban area or (b) primarily in an urban area. Urban areas are defined by Common Core of Data standards. Urban areas meet the following criteria: it is a territory inside an urbanized area and inside a principal city with a population of 250,000 or more, a territory inside an urbanized area and inside a principal city with a population of less than 250,000 and greater than or equal to 100,000, or a territory inside an urbanized area and inside a principal city with a population of less than 100,000.
We provide ICCs by district size categories because there is a relationship between the log number of schools and the value of the ICC (presented in the results section). In addition to district size, many studies are focused on impoverished populations and/or urban populations, which may have different ICC values. To that end, we also provide results by district size for districts with more and fewer than 50% of students eligible for free or reduced-price lunch and for districts that are or are not located in urban areas. For example, researchers conducting evaluation studies in large urban school districts will find Tables 6 and 7 most useful.
Statistical Methodology
Estimating District-Specific ICCs
The district-specific ICCs were estimated by selecting each eligible district in each state, selecting all students within a specific grade and setting the outcome to either the reading or the math score. Once selected, we estimated an unconditional two-level mixed model using restricted maximum likelihood,
where yij
is the score from the ith student from school j, μ is the average of school average, η
j
is the school random effect, and
Random Effects Meta-Analysis of District-Specific ICCs
Our analysis produced several thousand district-specific ICCs, many of which are estimated from small districts where concerns about privacy are relevant. Moreover, there is considerable variation in estimates from similar districts, undoubtedly due to random sampling error. Therefore, instead of providing tables with several hundreds of estimates, we instead summarize our results by presenting average ICCs derived from a random effects meta-analysis (Hedges & Vevea, 1998). Subsequently we provide a brief overview of this procedure in the context of our study.
The goal of a meta-analysis is to summarize the results of a series of estimations in order to provide guidance on the expected “effect.” In our case, the effect is the ICC, and we wish to estimate the population’s typical ICC, based on a given set of k estimates, for use in planning CRTs. If we assumed that the true ICC was the same in all districts (in other words treating the districts as fixed effects), we would conceptualize any estimate, Yi
, as the sum of the true effect, θ, and the sampling error, ∊
i
,
Of course, we don’t know the true effect, only the estimates and the sampling variation associated with them. We can achieve an estimate of the true effect by using the inverse variance of the estimate as a weighting variable. For our ICCs, the estimated variance for the ith ICC is (Fisher, 1925):
where ni
is the harmonic mean number of students per school in the district and mi
is the number of schools in that district. The weight for each ICC is then simply,
and the estimate of the true ICC would be defined as,
However, a weakness of this approach is that we cannot assume the same ICC for all districts, even in a subgroup, for two reasons. First, we are making a generalization beyond the observed results. This introduces a random effect beyond the sampling error that must be addressed. A second, more nuanced, set of problem with the fixed effects approach is that each state employs a different standardized test (at least in our data), each state organizes districts in a slightly different way, and the way districts organize their students is not universal. Thus, ICCs are derived from slightly different processes across our observed districts. As a result, we must employ a random effects approach to the meta-analysis.
In a random effects meta-analysis, we conceptualize the estimate, Yi
, as the sum of the average of the true effects, µθ, the district’s deviance from the average of the true effect, ζ
i
, and the sampling error, ∊
i
,
We must therefore account for the variance associated with both sampling errors and the variance in the true district values around the average true effect. This is accomplished by estimating the between-district variance of the ICCs, τ2. This quantity can be estimated with a method of moments estimator, given in Hedges and Vevea (1998) as:
where
Also note that this estimation makes no assumptions about the underlying distribution of the effects (i.e., ICCs). Therefore, it is still an unbiased estimate of the variation in ICCs. However, we do not recommend the use of this variance component to compute a range of a plausible ICC values (e.g.,
To test the null hypothesis that τ = 0, we use the fact that Q, given in Equation 12, has a χ2 distribution with k − 1 degrees of freedom when τ = 0. With this estimate, we calculate the random effects weight for each ICC:
The summary reported in our results is then the weighted mean of the observed ICCs:
and its standard error (the square root of the inverse of the sum of the weights),
Note that if τ is estimated to be negative, in which case we truncate
Database of District-Specific Estimates
This section describes the database of district-specific estimates that we compiled. In a small number of cases, the ICC estimates were quite large and could inflate the estimate of τ2. Therefore, to avoid allowing outliers to have disproportionate influence on our estimate of τ2, we removed the top 1% of estimates, redacting 71 estimates from our input data greater than the 99th percentile of the estimates (.557). Although this did not have a measurable impact on the mean estimate, it substantially decreased the estimate of τ2. This resulted in a set of 3,555 ICCs for mathematics achievement and 3,557 ICCs for reading achievement. Table 1 presents the number of eligible districts by state, grade, and subject. Table 1 also includes the number of students used in the estimates.
Number of Estimated ICCs and Sample Sizes.
Note. ICC = intraclass correlation. Number of students are given in parentheses.
The results presented in this article are based on over 3.1 million students. Of the ICCs computed, 16% are from urban areas and about 58% are from high-poverty areas. About 57% of the nonurban areas and 62% of the urban areas are high poverty. Figure 1 presents the number of ICCs estimated by district size, urbanicity, and poverty. The modal ICC is estimated from a nonurban, high-poverty, medium-sized district, followed by a similar small district. The next most common ICC is estimated from a nonurban, low-poverty, small district. Larger districts were more prevalent in urban areas as would be expected.

Number of estimates by grade-specific district size, urbanicity, and poverty.
The mean ICC estimated from all districts, grades, and subjects was .094, with a standard deviation of .092. The distribution is highly skewed, with a median of .056. The estimated ICCs for math had a mean of .104 and standard deviation of .104. The 10th, 25th, 50th, 75th, and 90th percentiles for math were .001, .027, .076, .149, and .246, respectively. The estimated ICCs for reading were generally lower, with a mean of .084 and standard deviation of .092. The 10th, 25th, 50th, 75th, and 90th percentiles for reading were <.001, .018, .056, .118, and .200, respectively.
Figure 2 presents box plots of the estimated district-specific ICCs by subject, grade, and district size. Each box plot presents a highly skewed distribution. In all districts, math ICCs tend to have a larger median than the reading ICCs, and the medians generally rise with grade level. The variance also increases with grade levels. Examining the box plots for the very small districts, those below the 10th percentile, we see the reverse pattern: ICCs and the variance decrease with grade. The small and medium school districts do not display a consistent pattern with grades, except that the eighth-grade variance is larger. Finally, large school districts are more reflective of the overall pattern.

Box plot of district-specific intraclass correlations (ICCs) by grade, grade-specific district size, and subject.
Finally, as support for presenting meta-analyses by district size, we estimated unweighted correlations between the district-specific ICC by the log of district size (number of schools) for each subject and grade. The correlation coefficients ranged from .52 to .75, with a median of .70, which supports the claim that ICCs are related to district size.
Results of Meta-Analyses
Tables 2 –11 present the estimated mean ICCs and τ for math and reading achievement by grade and district size. In these tables, we also present the empirical 25th, 50th (the median), and 75th percentiles to give a sense of the observed distribution and variance. Each table is organized in a series of horizontal panels, each for a district size category, with rows for each grade. The number of districts used for the analysis is denoted as k. Tables 2 and 3 present results for all districts. These results are useful for research designs that sample districts with a variety of characteristics and are not limited to only impoverished or rural areas.
Results of Random Effects Meta-Analysis of Within-District ICCs for Mathematics Achievement by District Size and Grade.
Note. ICC = intraclass correlation. a τ estimated as 0. Very small districts defined as the 10th percentile of size weighted by students served by grade. Small districts defined as the 25th percentile of size weighted by students served by grade. Medium districts defined as the 50th percentile of size weighted by students served by grade. Large districts defined as the >50th percentile of size weighted by students served by grade.
*p(τ = 0) < .05, standard errors are given in parentheses.
Results of Random Effects Meta-Analysis of Within-District ICCs for Reading Achievement by District Size and Grade.
Note. ICC = intraclass correlation. aτ estimated as 0. Very small districts defined as the 10th percentile of size weighted by students served by grade. Small districts defined as the 25th percentile of size weighted by students served by grade. Medium districts defined as the 50th percentile of size weighted by students served by grade. Large districts defined as the >50th percentile of size weighted by students served by grade.
*p(τ = 0) < .05, standard errors are given in parentheses.
Results of Random Effects Meta-Analysis of Nonurban Within-District ICCs for Mathematics Achievement by District Size and Grade.
Note. ICC = intraclass correlation. aτ estimated as 0. Very small districts defined as the 10th percentile of size weighted by students served by grade. Small districts defined as the 25th percentile of size weighted by students served by grade. Medium districts defined as the 50th percentile of size weighted by students served by grade. Large districts defined as the >50th percentile of size weighted by students served by grade.
*p(τ = 0) < .05, standard errors are given in parentheses.
Results of Random Effects Meta-Analysis of Nonurban Within-District ICCs for Reading Achievement by District Size and Grade.
Note. ICC = intraclass correlation. aτ estimated as 0. Very small districts defined as the 10th percentile of size weighted by students served by grade. Small districts defined as the 25th percentile of size weighted by students served by grade. Medium districts defined as the 50th percentile of size weighted by students served by grade. Large districts defined as the >50th percentile of size weighted by students served by grade.
*p(τ = 0) < .05, standard errors are given in parentheses.
Results of Random Effects Meta-Analysis of Urban Within-District ICCs for Mathematics Achievement by District Size and Grade.
Note. ICC = intraclass correlation. aτ estimated as 0. Very small districts defined as the 10th percentile of size weighted by students served by grade. Small districts defined as the 25th percentile of size weighted by students served by grade. Medium districts defined as the 50th percentile of size weighted by students served by grade. Large districts defined as the >50th percentile of size weighted by students served by grade.
*p(τ = 0) < .05, standard errors are given in parentheses.
Results of Random Effects Meta-Analysis of Urban Within-District ICCs for Reading Achievement by District Size and Grade.
Note. ICC = intraclass correlation. aτ estimated as 0. Very small districts defined as the 10th percentile of size weighted by students served by grade. Small districts defined as the 25th percentile of size weighted by students served by grade. Medium districts defined as the 50th percentile of size weighted by students served by grade. Large districts defined as the >50th percentile of size weighted by students served by grade.
*p(τ = 0) < .05, standard errors are given in parentheses.
Results of Random Effects Meta-Analysis of Low-Poverty Within-District ICCs for Mathematics Achievement by District Size and Grade.
Note. ICC = intraclass correlation. aτ estimated as 0. Very small districts defined as the 10th percentile of size weighted by students served by grade. Small districts defined as the 25th percentile of size weighted by students served by grade. Medium districts defined as the 50th percentile of size weighted by students served by grade. Large districts defined as the >50th percentile of size weighted by students served by grade.
*p(τ = 0) < .05, standard errors in parentheses.
Results of Random Effects Meta-Analysis of Low-Poverty Within-District ICCs for Reading Achievement by District Size and Grade.
Note. ICC = intraclass correlation. aτ estimated as 0. Very small districts defined as the 10th percentile of size weighted by students served by grade. Small districts defined as the 25th percentile of size weighted by students served by grade. Medium districts defined as the 50th percentile of size weighted by students served by grade. Large districts defined as the >50th percentile of size weighted by students served by grade.
*p(τ = 0) < .05, standard errors are given in parentheses.
Results of Random Effects Meta-Analysis of High-Poverty Within-District ICCs for Mathematics Achievement by District Size and Grade.
Note. ICC = intraclass correlation. aτ estimated as 0. Very small districts defined as the 10th percentile of size weighted by students served by grade. Small districts defined as the 25th percentile of size weighted by students served by grade. Medium districts defined as the 50th percentile of size weighted by students served by grade. Large districts defined as the >50th percentile of size weighted by students served by grade.
*p(τ = 0) < .05, standard errors are given in parentheses.
Results of Random Effects Meta-Analysis of High-Poverty Within-District ICCs for Reading Achievement by District Size and Grade.
Note. ICC = intraclass correlation. aτ estimated as 0. Very small districts defined as the 10th percentile of size weighted by students served by grade. Small districts defined as the 25th percentile of size weighted by students served by grade. Medium districts defined as the 50th percentile of size weighted by students served by grade. Large districts defined as the >50th percentile of size weighted by students served by grade.
*p(τ = 0) < .05, standard errors are given in parentheses.
However, other designs may be more specific. To serve those researchers, Tables 4 and 5 present results for nonurban districts. Tables 6 and 7 present results for urban districts. Tables 8 and 9 present results for low-poverty (less than 50% of students who are eligible for free or reduced price lunch) districts. Finally, Tables 10 and 11 present results for high-poverty (at least 50% of students who are eligible for free or reduced-price lunch) districts. In this section, we present some patterns (or lack of patterns) of interest. Overall, each pattern noted here will have exceptions, but the following will provide some basic insight into the distribution of ICCs.
Results for all Districts and Comparison With Statewide Estimates
As is typical in other studies, we generally see that ICCs for mathematics are higher than ICCs estimated for reading. This is a relatively stable pattern, but there are exceptions. In medium-sized districts, the reading ICCs are larger for Grades 4 and 6. They are also larger in small eighth-grade districts. For mathematics achievement, the average within-district ICC derived from individual three-level models (students nested within schools nested within districts) from each state is about .11 for Grades 3, 4, and 5. The meta-analysis results in smaller ICCs for Grades 3, 4, and 5 are .07, .06, and .08, respectively. Although our seventh-grade estimate is also smaller, .10 versus .13, our estimates for eighth grade are larger than the three-level models, .17 versus .16.
We also observe smaller estimated mean ICCs in our analysis for reading achievement in all grades, compared to the results from analyses that pool all data across districts within each state. In Grades 3 and 4, the average results from the state data that pool the information across districts are much larger than the estimated average of the district-specific estimates from the meta-analysis, .10 versus .05 and .10 versus .06, respectively. In Grade 7, the result from our meta-analysis is smaller than the average of the district-specific estimates, .07 versus .10.
To contextualize the results presented in this section to results from estimates that pool all data across districts, we provide the following guidance. The results in this study are appropriate for planning targeted samples that do not represent an entire state. Conversely, the results from data that pool information across all districts are meant to inform designs that draw a sample from all districts.
Results by District Size
In most grades, the ICCs are larger in large districts. For example, in Table 2, the Grade 3 math ICCs for very small, small, medium, and large districts are .009, .012, .084, and .118, respectively. This general pattern holds for all grades in math except Grade 7, where the small districts have a larger ICC than the medium districts. Another notable feature is that the pattern is not exceptionally linear, with the larger districts having much larger ICCs than the smaller districts.
Also of note is that the largest districts show similar ICCs for earlier grades compared with the ICCs from three-level models. For example, the mean math ICC from the meta-analysis for Grade 3 in the largest districts is .118, which is similar to the average from the three-level model analyses (.112). This supports the hypothesis that the three-level models are unduly influenced by larger districts.
Results by Grade
Previous investigations found that the ICCs generally increase with grade level (Hedges & Hedberg 2007a, 2007b, and 2014). While we again find this is true in broad strokes, closer examination reveals a more complicated picture. In all districts, we observe a pattern for math in which mean ICCs increase, with Grades 3, 4, 5, 6, 7, and 8 having mean ICCs of .072, .063, .080, .084, .097, and .169, respectively. This pattern is not evident in reading, with the ICCs appearing to “bounce around” for Grades 3–7. These patterns hold in the smaller districts as well, although there is a linear increase in the ICCs by grade in the largest districts.
Results by Urbanicity
Tables 4 and 5 present ICCs for mathematics and reading achievement for nonurban districts, while Tables 6 and 7 present math and reading ICCs for urban districts. Some combinations of district size and urbanicity were not represented in our data and thus meta-analysis was not possible. For districts of any size, we generally find ICCs in urban areas are larger for the lower grades than those in nonurban areas. In the higher grades, the nonurban areas tend to have larger ICCs, especially in eighth grade.
Results by Poverty
Tables 8 and 9 present ICCs for mathematics and reading achievement for districts with less than a 50% rate of free or reduced-price lunch eligible students, while Tables 10 and 11 present math and reading ICCs for districts with at least half of students eligible for free or reduced-price lunch. For districts of any size, we generally find that ICCs in high-poverty districts are larger than those in low-poverty districts. The exception to this pattern is seventh-grade reading, where the high-poverty ICC is slightly lower than the low-poverty ICC.
In the smaller districts, we generally see that the high-poverty ICCs are higher than the low poverty ICCs in the earlier grades (i.e., Grades 3–5). In Grades 6–8, however, it is the low-poverty ICCs that are smaller in the smaller districts. In the medium size districts, the reading ICCs are lower in most grades for low-poverty compared to high-poverty districts, whereas the math ICCs do not seem to follow a pattern. Finally, in the largest districts, the math ICCs are higher in the high-poverty districts for the lower grades and are generally higher for reading in most grades except eighth.
Variation in Estimated ICCs
In Tables 2
–11, we also report the variation in ICCs for grade and district size combinations as the standard deviation
Discussion
In this study, we provide expected ICCs by grade and subject in a variety of contexts that will also be of interest to evaluation researchers, namely district size, urbanicity, and poverty status. Although we generally found expected patterns, the smaller districts presented a picture that was less consistent. Perhaps the sampling variability associated with smaller districts creates difficulty in uncovering patterns of results or perhaps such patterns do not exist.
To our knowledge, this is the first investigation of the distribution of within-district ICCs. One of the more important findings of this study is simply that these ICCs tend to be quite small for earlier grades in the smaller districts. This is particularly important for planning interventions in these settings because pretests on academic achievement that might be used as covariates to improve statistical power are less frequently available in administrative data on younger children. Given the small ICC estimates, the practice of spending resources of pretests may not be necessary.
Finally, it is worth noting that the more heterogeneous districts, in terms of expected ICCs, are the largest districts. This is not surprising, given the diversity found in large urban areas. It does, however, highlight the need to utilize local data when available. Although previous publications have provided such data from the handful of large urban districts that have been studied in previous evaluations, our results provide some guidance in other situations where little data are available.
Limitations
We have two limitations of this study to outline. The first limitation was the amount of data available to produce ICC estimates to analyze. To be sure, we employ a relatively large amount of data compared to most education studies, but for this article the unit of analysis is the district. With 11 states, we only have between 400 and 780 districts per grade to analyze. Thus, more detailed breakdowns by urbanicity and poverty produce too few estimates in certain cells to produce reliable means. Until more data are available, the conservative approach would be to employ the larger of available ICCs for more targeted sample (e.g., impoverished urban areas). Finally, we would also like to have a more detailed urbanicity breakdown, but again we are restricted by our sample size.
Conclusion
We have presented empirical evidence about design parameters useful in planning CRT experiments that used academic achievement as outcomes and where districts of a particular size or type are employed. Our estimates are means derived from random effects meta-analyses and are presented along with standard errors that provide some sense of the sampling error inherent in these estimates. We now turn to the question of how best to use these design parameters in planning CRTs and what some limitations might be.
The ICC values reported in this tabulation differ from the national values reported by Hedges and Hedberg (2007a) and their more recent work (2014). Although the evidence reported in this article is based on near-census data from several state longitudinal data systems, it is data from only 11 states and only Grades 3–8. Although our estimates should be more relevant for some studies than national estimates like those of Hedges and Hedberg (2007a), there are significant heterogeneities across large districts. However, for certain applications, the results in this article may prove more useful than estimates derived from models that pool results across states. This study has also revealed that the distribution of within-district school-level ICCs is highly skewed, with numerous very small ICCs and a small number of large values. However, meta-analyses reveal that small districts have ICCs of relatively uniform size, with measures of variation seldom statistically differing from zero. A final caution is that the estimates reported here are based on state assessments and thus would be less relevant to studies using achievement tests that are not aligned with instruction.
Example Power Analyses for CRTs
Putting such limitations aside, these values can be used with several pieces of software designed for multilevel power computations, including Optimal Design for Windows (Spybrook, Raudenbush, Liu, Congdon, & Martínez, 2006), RDPOWER for Stata (Hedberg, 2012), and commercial software such as CRT-Power (Borenstein, Hedges, & Rothstein, 2012). Here, we provide an example with immediate commands in Stata.
Suppose we were to perform an experiment that impacts mathematics achievement of third graders in eight large school districts, with each district having 12 schools. We plan to collect data on 30 students in each of these 96 schools. Power to detect an effect size of 0.2 is computed using the noncentrality parameter (Equation 4) by entering the following into Stata: . scalar K = 12 . scalar J = 8 . scalar n = 30 . scalar es = .2 . scalar rho = .118 . display nFtail(1, K*(J-2),(K*J*es^2)/(4*(rho+((1-rho)/n))), invFtail(1, K*(J-2),.05)).
The result gives the statistical power of a two-tailed test at the .05 level of significance as .71.
Summary
The main finding of this study is that district size matters. In some cases, employing smaller districts (and thus fewer schools) may yield better power because the ICCs are that much smaller. While which districts participate in a study is rarely under investigator control, suppose we are designing a study for third-grade math and the choice is between using 4 medium districts with 10 schools each (ICC = .031) and 4 large districts with 20 schools each (ICC = .118). Holding other factors constant, and assuming a fixed effects design, the smaller districts yield slightly more power for each effect size. Of course, employing a single large district, even with a larger ICC, may have the practical benefit of only having to recruit a single education agency.
Footnotes
Authors’ Note
The opinions expressed are those of the authors and do not necessarily represent the views of the Institute or the U.S. Department of Education.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D110032, NORC at The University of Chicago.
