Lorenz Interpolation: A Method for Estimating Income Inequality from Grouped Income Data

Abstract

To understand how income inequality affects individuals and communities, researchers must have accurate measures of income inequality at lower geographic levels, such as counties, school districts, and census tracts. Studies of income inequality, however, are constrained by the tabular format in which censuses publish income data. In this article, the author proposes a new method, Lorenz interpolation, for estimating income inequality from binned income data. Using public microsample data from the American Community Survey (ACS), the author shows that Lorenz interpolation produces more accurate and reliable income inequality estimates than do alternative estimation methods. Then, using restricted ACS income data obtained through a Federal Statistical Research Data Center, the author evaluates the accuracy of Lorenz interpolation at the census tract and school district levels. Lorenz interpolation produces reliable school district–level estimates, but the method produces less reliable estimates for some income inequality measures at the tract level. These findings indicate that researchers should refrain from estimating tract-level income inequality measures from tabular data. They also show that aggregating tract income distributions to higher geographic levels can produce valid estimates of income inequality.

Keywords

income inequality grouped data estimation

Income inequality has increased substantially in many countries around the world. In the United States, income inequality is higher today than it has been at any point since the beginning of the Great Depression (Galster and Sharkey 2017). Rising income inequality has been implicated in a range of negative public health consequences (Pickett et al. 2005; Pickett and Wilkinson 2007), and research shows a connection between income inequality and economic segregation (Reardon and Bischoff 2011), which has implications for educational stratification (Owens 2016). Growing income inequality has also driven increasing inequities in the intergenerational transmission of wealth (Chetty et al. 2017), which has led to decreased levels of social mobility (Grusky and MacLean 2016).

Amid growing concern around these and other possible consequences of rising income inequality, researchers have sought ways to measure the distribution of income more precisely to trace how and where it is changing over time. They are constrained in their efforts, however, by the format in which most countries release their income data. To protect respondent confidentiality, censuses generally publish income data in tabular format, as counts in income brackets ($0–$9,999, $10,000–$14,999, . . ., ≥$200,000). Researchers attempting to use these data to estimate income inequality must make assumptions about how incomes are distributed within these brackets. Moreover, the highest income bracket in these data is always unbounded at the top, which makes estimating the upper tail of the income distribution especially difficult. For some income inequality measures, even small errors in estimation of this upper tail can lead to large errors in estimation of income inequality.

A number of studies have proposed methods to estimate income statistics using grouped income data (Gastwirth and Glauberman 1976; Hajargasht et al. 2012; Jargowsky and Wheeler 2018; Kakwani 1976; Minoiu and Reddy 2012; Quandt 1966; Tillé and Langel 2012; von Hippel, Hunter, and Drown 2017; von Hippel, Scarpino, and Holas 2016). Two recent studies show that income distribution means, which are provided in the public portions of many national censuses, can be used to produce more accurate estimates of income inequality (Jargowsky and Wheeler 2018; von Hippel et al. 2017). Still, two problems remain. First, even these methods have difficulty with inequality measures that rely heavily on the upper tail of the income distribution. This is unfortunate because some of these measures have desirable properties. For instance, the Theil coefficient can be decomposed into groups to determine how different kinds of households or people contribute to income inequality. This makes Theil coefficients useful to researchers interested in understanding how changing levels of income inequality between geographic subregions or racial groups have contributed to overall changes in income inequality.

Second, these methods have been evaluated only using income data for large geographic regions, such as metropolitan areas and counties. To understand the social consequences of income inequality, it often makes more sense to focus on a lower geographic level. For instance, someone interested in the relationship between income inequality and educational inequality may want to work at the level of the school district, which determines the portion of a school’s budget that comes from property taxes, or the school attendance zone, which delineates which schools a child is eligible to attend. Alternatively, if a researcher wants to understand how income inequality is experienced in one’s community, the relevant geography might be the neighborhood, which is often operationalized by census tract. Given that geographies such as these cover areas with smaller populations, existing methods for estimating income inequality may be insufficiently reliable when deployed at lower levels of analysis. Some researchers have admonished against using these methods to estimate inequality for smaller geographic regions (von Hippel et al. 2016), but how these methods fare at producing income inequality estimates for these regions has yet to be empirically tested.

In this article, I outline a new method for estimating income inequality from grouped income data. The method, which I call Lorenz interpolation, consists of using income quantiles derived from these data to estimate the Lorenz curve of the underlying income distribution.¹ This Lorenz curve is used to create a weighted sample of exact incomes, from which income statistics can be derived. A key advantage of Lorenz interpolation over other methods comes from the accuracy with which the method estimates the upper tail of the income distribution.²

Using public microsample data from the 2011 to 2015 American Community Survey (ACS) and working at the public-use microdata area (PUMA) level, I show that Lorenz interpolation outperforms mean-constrained integration over brackets (MCIB) and cumulative density function (CDF) interpolation at estimating income inequality, especially inequality measures that are sensitive to the upper tail of the income distribution. Then, I evaluate the performance of Lorenz interpolation at two lower geographic levels, the census tract level and the school district level, using restricted census data to which I have been granted access through a Federal Statistical Research Data Center (FSRDC).³ Results indicate that Lorenz interpolation produces more accurate estimates of income inequality at all three geographic levels. However, tract-level estimates are insufficiently reliable for most analyses. Lorenz interpolation yields more accurate estimates of the Gini, Theil, and Atkinson coefficients at the school district level. Geographies such as these that consist of small groups of census tracts are sufficiently large for the accurate estimation of income inequality. I conclude with a discussion of the use cases of Lorenz interpolation and some unresolved issues surrounding the measurement of income inequality.

Background

A common method for deriving income inequality statistics from grouped income data is to set the incomes in the closed income bins to their midpoints and estimate a Pareto distribution for the open bin at the top of the income distribution (Henson 1967; Jargowsky 1996).⁴ Fitting a Pareto distribution to the top bin requires that one estimate a Pareto distribution shape parameter, $α$ , which can be computed using counts of households in the top two bins of the grouped income data (Jargowsky and Wheeler 2018:346). This technique, known as the Pareto-linear procedure, is based on Pareto’s observation that the relationship between populations and incomes of the income bins in the upper tail of the income distribution tends to be linear (Nielsen and Alderson 1997). Once $α$ is estimated, the mean of the Pareto distribution for the top bin can be computed with the following formula:

μ_{top} = β \frac{α}{α - 1},

(1)

where $β$ is the lower bound of the top income bin. Incomes in this bracket are set to the estimate of $μ_{top}$ .

One problem with this method is that mean estimates of the top bin derived from the Pareto-linear procedure are not robust to even small errors in estimation of $α$ (von Hippel et al. 2016). This is particularly true when this procedure underestimates $α,$ as the Pareto distribution is undefined when $α$ is less than 1, and its mean approaches infinity as $α$ approaches 1 from above. Furthermore, by imputing midpoints for the closed bins and a Pareto distribution mean for the open bin, this technique ignores variation within the income brackets. Depending on the shape of the underlying income distribution, this omission can produce negatively or positively biased income inequality estimates (Heitjan 1989).

Recently, researchers have gotten around limitations of the midpoint-Pareto imputation method by using the income distribution mean to estimate probability and CDFs from grouped income data.⁵ Proposing a method called CDF interpolation, von Hippel et al. (2017) use grouped income data to plot a set of points along the CDF of the income distribution. After interpolating the CDF, the authors drew from the income distribution mean to apply an upper bound to the CDF.⁶ Income statistics are derived from the CDF through numerical integration. When tested on county-level income distributions, the authors showed that CDF interpolation estimates Gini coefficients within 1 percent to 2 percent of their true values (von Hippel et al. 2017:651).⁷

Jargowsky and Wheeler (2018) also proposed a method that takes advantage of the income distribution mean. Rather than fitting a function to points along the CDF, the authors estimated a probability density function (PDF) directly from the grouped income data. Their technique, which they called mean-constrained integration over brackets, consists of estimating a piecewise linear function for the closed income bins, fitting a Pareto distribution to the top bin, and integrating over the resulting distribution.⁸ MCIB uses the mean of the income distribution to estimate the mean of the top income bin.⁹ The authors showed their method outperforms many estimation techniques that do not incorporate the income distribution mean.

Methods that use the income distribution mean produce more accurate estimates of some inequality measures, but these techniques are less successful at estimating parameters that are sensitive to the upper tail of the income distribution. This is because the top bin in grouped income data usually does not have an upper bound. The method I put forth here provides more accurate estimates of inequality measures that rely on the upper tail of the income distribution. This method, which I call Lorenz interpolation, uses the income quantiles contained in grouped income data to estimate a Lorenz curve.

Lorenz curves visually represent the level of inequality in a distribution. To create a Lorenz curve from a set of incomes, one sorts these incomes in ascending order and computes a running sum. This is divided by the total income to produce a cumulative income share, which is plotted against the cumulative population share. The cumulative population share is plotted on the x axis, and the cumulative income share is plotted on the y axis.

Much of the information about the shape of the Lorenz curve is provided by the income quantiles in grouped income data. The connection between income quantiles and the Lorenz curve is reflected by the following equation¹⁰:

L^{'} (F (x)) = \frac{x}{μ},

(2)

where $L^{'}$ is the derivative of the Lorenz function taken with respect to $x$ , $F (x)$ is the cumulative population share (the proportion of households that have an income less than or equal to $x$ ), $μ$ is the income distribution mean, and $x$ is an income quantile. Equation (2) says the slope of the Lorenz curve can be calculated at any point by dividing the income quantile at that point by the income distribution mean. Given that the income data provided in the census include both the income distribution mean and several income quantiles, these data can be used to determine the slope of the Lorenz curve at several points along the x axis. For example, if 10 percent of the population belongs to the bottom income bin, and this bin has an upper bound of $10,000, then the Lorenz curve has a slope of $\frac{10000}{μ}$ at $F (x)$ = 0.1, where $F (x)$ is plotted on the x axis. In addition to using each bin’s income quantiles to compute slopes along the Lorenz curve, one can use the quantiles of neighboring bins to estimate the upward trajectory of the Lorenz curve. This determines the bin mean estimates.

Although Lorenz curves are rarely found in sociological research, economists have developed techniques to estimate inequality statistics by interpolating the Lorenz curve (Cowell and Mehta 1982; Gastwirth 1971; Gastwirth and Glauberman 1976; Gastwirth, Nayak, and Krieger 1986; Tillé and Langel 2012). Many of these techniques were developed for data that contain more information about the income distribution than do the data in many national censuses (e.g., Cowell and Mehta 1982). In contrast, this study imposes the same limitations on the data as those found in publicly available income data.

Methods

Lorenz interpolation produces income statistics by building a spline function that approximates the Lorenz curve of the underlying income distribution. Each segment of this spline is a cubic function estimated using points on the Lorenz curve and a set of constraints that determine slopes for the Lorenz curve at each of the income boundaries. The estimated Lorenz curve is used to produce a weighted sample of exact incomes, from which inequality statistics are derived.

Building a Lorenz Curve from Grouped Income Data

Figure 1 illustrates the steps though which Lorenz interpolation estimates a Lorenz curve. The vertical dashed lines represent cumulative population shares that can be computed from the grouped income data. The income boundaries associated with these lines are used to compute slopes along the Lorenz curve, which are represented by the gray line segments.

Figure 1.

Lorenz interpolation in three steps.

Lorenz interpolation approximates a Lorenz curve by estimating several cubic functions in a sequence. The left plot of Figure 1 shows the first cubic defined by Lorenz interpolation. This function is constrained to pass through (0, 0), the point at which the Lorenz curve begins. The slope of the function is constrained to equal $\frac{l_{1}}{μ}$ at $x = F (0)$ , $\frac{u_{1}}{μ}$ at $x = F (u_{1})$ , and $\frac{u_{2}}{μ}$ at $x = F (u_{2})$ , where $l_{1}$ is the lower bound of the first bin, $u_{1}$ and $u_{2}$ are the upper bounds of the first and second bins, and $F$ is the cumulative population share plotted on the x axis. After the first cubic function, ${\hat{L}}_{1}$ , is estimated, the y-coordinate of the point on the Lorenz curve associated with the lower bound of the second bin is estimated as ${\hat{L}}_{1} (F (l_{2}))$ . Then, a second function is defined that is constrained to go through ( $F (l_{2})$ , ${\hat{L}}_{1} (F (l_{2}))$ ) with slopes of $\frac{l_{2}}{μ}$ at $x = F (l_{2})$ , $\frac{u_{2}}{μ}$ at $x = F (u_{2})$ , and $\frac{u_{3}}{μ}$ at $x = F (u_{3})$ (see the center plot in Figure 1).¹¹ These steps are repeated for all remaining bins except the top bin.¹²

The absence of an upper bound makes defining a cubic function for the top bin particularly challenging. Cubic functions fit to points along the Lorenz curve potentially underestimate the variance of incomes at the top of the income distribution (Kakwani 1976:489). To remedy this issue, Lorenz interpolation uses slope constraints to control the rate at which the cubic function applied to this bin curves upward. First, a quadratic function is defined that passes through two points, ( $F (l_{16})$ , ${\hat{L}}_{15} (F (l_{16}))$ ) and (1, 1). This function is constrained to have a slope of $\frac{l_{top}}{μ}$ at the bin lower bound. The slope of this function is then reduced at the bin midpoint. This results in a gradually increasing cubic that captures the variance at the upper tail of the income distribution.¹³

Having estimated a Lorenz curve, the final step is to create a sample of exact incomes based on this curve. A computationally efficient way to do this is to plot equidistant points along the Lorenz curve and multiply the slopes of the line segments connecting these points by the income distribution mean. This generates samples from the underlying income distribution, which can be weighted using the frequencies in the grouped income data. This weighted sample is used to produce various income statistics.

The PDF Implied by Lorenz Interpolation

The PDF implied by Lorenz interpolation is a piecewise function consisting of several square root functions. This can be recognized by deriving the PDF from the Lorenz curve. Equation (3) shows a segment of the estimated Lorenz curve, ${\hat{L}}_{j}$ , as a function of the cumulative population share $F$ :

{\hat{L}}_{j} (F (x_{j})) = a_{j} F (x_{j})^{3} + b_{j} F (x_{j})^{2} + c_{j} F (x_{j}) + d_{j} for F (l_{j}) \leq F (x_{j}) < F (u_{j}),

(3)

where $x_{j}$ is the set of incomes in bin $j$ and $a_{j}$ , $b_{j}$ , $c_{j}$ , and $d_{j}$ are the coefficients of the cubic function defined for the portion of the Lorenz curve associated with bin $j$ . The domain of this function is bounded by $F (l_{j})$ and $F (u_{j})$ , the cumulative population shares at the lower and upper bounds of bin $j$ .

To compute the PDF from the Lorenz function, one must first derive the CDF from this function. One way to do this is to use the following rule noted by Gastwirth (1971):

L (F (x)) = μ^{- 1} \int_{0}^{F (x)} F^{- 1} (t) dt for 0 \leq F (x) \leq 1 .

(4)

This is the definition of the Lorenz curve as a function of the cumulative population share $F .$ Taking the derivative of both sides of this equation with respect to $F (x)$ and multiplying by $μ$ yields the following:

μ L^{'} (F (x)) = F^{- 1} (F (x)) = x for 0 \leq F (x) \leq 1 .

(5)

It follows that taking the derivative with respect to $F (x)$ of equation (3) and multiplying by the distribution mean $μ$ will produce an equation that gives income, $x_{j}$ , as a function of $F (x_{j})$ :

μ {\hat{L}}_{j}^{'} (F (x_{j})) = x_{j} = μ (3 a_{j} F {(x_{j})}^{2} + 2 b_{j} F (x_{j}) + c_{j}) for F (l_{j}) \leq F (x_{j}) < F (u_{j}) .

(6)

Everything can be moved to the right side of the equation by dividing both sides by $μ$ and subtracting $\frac{x_{j}}{μ}$ :

0 = 3 a_{j} F (x_{j})^{2} + 2 b_{j} F (x_{j}) + c_{j} - \frac{x_{j}}{μ} for F (l_{j}) \leq F (x_{j}) < F (u_{j}) .

(7)

The function for the CDF, $F (x_{j})$ , can now be computed using the quadratic formula:

F (x_{j}) = \frac{- 2 b_{j} + \sqrt{4 {b_{j}}^{2} - 12 a_{j} (c_{j} - \frac{x_{j}}{μ})}}{6 a_{j}} for F (l_{j}) \leq F (x_{j}) < F (u_{j}) .

(8)

Taking the derivative of both sides with respect to $x_{j}$ yields the PDF of the income distribution:

f (x_{j}) = \frac{1}{μ \sqrt{4 {b_{j}}^{2} - 12 a_{j} (c_{j} - \frac{x_{j}}{μ})}} for F (l_{j}) \leq F (x_{j}) < F (u_{j}) .

(9)

Finally, the domain is rewritten in terms of $x_{j}$ :

f (x_{j}) = \frac{1}{μ \sqrt{4 {b_{j}}^{2} - 12 a_{j} (c_{j} - \frac{x_{j}}{μ})}} for l_{j} \leq x_{j} < u_{j} .

(10)

Note that the slope constraints used to compute $a_{j}$ , $b_{j}$ , $c_{j}$ , and $d_{j}$ ensure the domain of this function is bounded at the income boundaries of bin $j .$ One can compute these boundaries from the approximated Lorenz curve by plugging $F (l_{j})$ and $F (u_{j})$ into $μ {\hat{L^{'}}}_{j} (.)$ (equation 2), which is also the expression used here to derive the CDF from the Lorenz function.

Figure 2 shows an example of a PDF based on Lorenz interpolation. The spikes along the upper tail of the income distribution are a result of the square root function Lorenz interpolation uses to approximate each bin of the income distribution. These spikes indicate some clustering of incomes at the bin lower bounds along the upper tail of the income distribution. Despite the presence of these spikes, the results show that Lorenz interpolation yields accurate estimates of local income statistics such as bin means, income quantiles, and income shares.

Figure 2.

Example of a probability density function based on Lorenz interpolation.

Data and Measures

Data for this study come from the 2011 to 2015 five-year pooled ACS. The ACS contains social and economic data for the U.S. population and is administered on a rolling basis by the U.S. Census Bureau. To provide more reliable estimates, the Census Bureau publishes ACS data in five-year groupings, which cover roughly 5 percent of the U.S. population (Ruggles et al. 2021). I used household income data from the public-use microdata sample (PUMS) component of the ACS, which contains households’ exact incomes.¹⁴ I calculated estimates of income inequality for 1,185 PUMAs, which are the smallest geographies for which PUMS data are available.

To calculate income inequality for tracts, I used restricted census data to which I was granted access through a FSRDC. These data contain exact incomes with geographic information down to the block level. To produce income inequality estimates for school districts, I used a crosswalk between census tracts and school districts from the National Center for Education Statistics (Geverdt 2019). Tract-level incomes were assigned to school districts on the basis of the distribution of each tract’s land area across school districts. For example, if 60 percent of a tract’s area is in district A and 40 percent is in district B, 60 percent of its incomes were assigned to district A and the remaining 40 percent were assigned to district B. School-district data are based on boundaries from 2013, which falls in the middle of the study period (2011–2015). These analyses were based on 69,675 tracts and 13,360 school districts.

Evaluating Lorenz Interpolation

To assess the performance of Lorenz interpolation, I created two data sets, an exact income data set and a grouped income data set. The latter was produced by converting each region’s exact income data into counts in income brackets, which were determined by the income bounds the Census Bureau has used since 2000. I calculated each region’s “true” level of income inequality by plugging the exact income data into an income inequality formula.¹⁵ I applied Lorenz interpolation to the grouped income data to generate an income inequality estimate. Finally, I calculated the error metrics by comparing “true” and estimated income inequality.

I evaluated Lorenz interpolation with four measures of income inequality. First, I looked at Gini coefficients, which can be computed using the following formula:

G = \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{n} | x_{i} - x_{j} |}{2 n \bar{x}} .

(11)

The Gini coefficient is the mean absolute difference among incomes divided by twice the aggregate income. Next, I computed the Theil coefficient. Based on information theory, Theil coefficients account for the level of entropy, or unpredictability, in a data set. Income distributions with low entropy have higher Theil coefficients, which indicate more inequality. Theil coefficients are calculated using the following formula:

T = \frac{1}{n} \sum_{i = 1}^{n} \log \frac{x_{i}}{\bar{x}} \frac{x_{i}}{\bar{x}} .

(12)

One nice feature of Theil coefficients is that they can be decomposed to determine contributions from different groups. This means that Theil contributions can be estimated for each bin of an income distribution. Performing such a decomposition shows that most of the Theil coefficient can often be attributed to the top bracket of grouped income data. I also computed the Atkinson index (Atkinson 1970). This measure includes a parameter that determines the relative influence of the lower and upper tails of the income distribution. It is defined by the following equation:

A = {\begin{matrix} 1 - \frac{1}{μ} {(\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{1 - ε})}^{1 / (1 - ε)} for 0 \leq ε \neq 1 \\ 1 - \frac{1}{μ} {(Π_{i = 1}^{n} x_{i})}^{1 / n} for ε = 1 \end{matrix},

(13)

where $ε$ , the inequality aversion parameter, determines the relative influence of the income distribution tails. I chose to set $ε$ to 0.2, which increases the influence of the upper tail on income inequality. Finally, I computed the standard deviation of the income distribution.

I compared the numbers generated from Lorenz interpolation to estimates from two other methods: von Hippel et al.’s (2017) CDF interpolation method and Jargowsky and Wheeler’s (2018) MCIB method. To implement CDF interpolation, I used the binsmooth package in R (Hunter and Drown 2016). To implement MCIB, I used Jargowsky’s (2019) MCIB module, which is available in Stata.

Results

Table 1 compares error terms from Gini, Theil, Atkinson, and standard deviation estimates based on MCIB, CDF interpolation, and Lorenz interpolation. The top three rows show the errors produced by running these methods without the income distribution means, and the bottom three rows display errors based on estimates that incorporate the income distribution means.¹⁶ Following von Hippel and colleagues (2017), I calculated the percentage relative bias and percentage relative root mean squared error (RMSE) of the estimates. These measures are based on the percentage estimation error ( $100 * \frac{\hat{θ} - θ}{θ}$ ). The percentage relative bias is the mean of the percentage estimation errors, and the percentage RMSE is the RMSE of the percentage estimation errors. Unlike the absolute bias and RMSE, these metrics are invariant to the different scales of the inequality measures, making accuracy comparisons between the measures possible. All estimates are rounded to the nearest hundredth.

Table 1.

Gini, Theil, Atkinson, and Standard Deviation Error Terms for Public-Use Microdata Areas

	Gini Coefficient		Theil Coefficient		Atkinson Index		S.D.
	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE
MCIB (no $μ$ )	2.10	4.04	.83	6.98	−1.84	7.92	11.70	19.10
CDF (no $μ$ )	−1.48	3.27	−4.71	8.76	−4.97	8.61	−4.93	11.4
Lorenz (no $μ$ )	−.70	3.01	−1.25	8.22	−1.74	7.44	0.00	11.4
MCIB ( $μ$ )	−.76	.97	−5.48	6.36	−1.81	4.81	−1.04	3.94
CDF ( $μ$ )	−1.25	1.38	−4.79	5.11	−4.72	5.86	−5.00	6.46
Lorenz ( $μ$ )	−.49	.75	−1.54	2.49	−1.82	3.04	−.89	2.84
N = 1,185

Source: American Community Survey, 2011 to 2015.

Note: CDF = cumulative density function; MCIB = mean-constrained integration over brackets; RMSE = root mean squared error.

Looking at the first three rows of the table, which compare inequality estimates without the income distribution mean, CDF interpolation outperformed MCIB at estimating the Gini coefficient and the standard deviation, and MCIB outperformed CDF interpolation at estimating the Theil and the Atkinson coefficients. Lorenz interpolation outperformed both methods at estimating the Gini coefficient, Atkinson coefficient, and standard deviation, and it performed worse than MCIB at estimating the Theil coefficient. Turning to estimates that incorporate the income distribution mean, Lorenz interpolation estimates had lower relative RMSEs for all four inequality measures and lower relative bias for all measures except the Atkinson coefficient. For Theil coefficients, Atkinson measures, and standard deviations, Lorenz interpolation produced considerably more accurate estimates as measured by the relative RMSE. Whereas Theil coefficient estimates based on MCIB had a relative RMSE of 6.36 percent, Theil coefficients from Lorenz interpolation had a relative RMSE of 2.49 percent. Furthermore, standard deviation estimates based on CDF interpolation had a relative RMSE of 6.46 percent, whereas standard deviation estimates from Lorenz interpolation had a relative RMSE of only 2.84 percent. Although Lorenz interpolation also outperformed the other methods at estimating the Gini coefficient, all three methods produced highly accurate Gini coefficient estimates, with relative RMSEs ranging from 0.75 percent to 1.38 percent.

Table 2 compares the relative bias and RMSE terms of bin mean estimates based on MCIB, CDF interpolation, and Lorenz interpolation. This table also includes the error terms for each method’s estimate of the total closed bin income, which determines the top bin mean estimates.

Table 2.

Bin Mean Error Terms for Public-Use Microdata Areas

	$ 0–$10,000		$10,000–$15,000		$15,000–$20,000		$20,000–$25,000
	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE
MCIB	13.53	32.72	1.34	2.93	.91	1.85	.87	1.50
CDF	30.56	45.10	1.28	3.02	.97	2.00	.83	1.58
Lorenz	18.77	33.39	.90	2.55	.89	1.69	.62	1.37
N = 1,185
	$25,000–$30,000		$30,000–$35,000		$35,000–$40,000		$40,000–$45,000
	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE
MCIB	.88	1.31	.90	1.26	.64	.96	.74	1.00
CDF	.85	1.43	.89	1.31	.62	.99	.74	1.05
Lorenz	.86	1.28	.73	1.16	.64	.94	.59	.90
N = 1,185
	$45,000–$50,000		$50,000–$60,000		$60,000–$75,000		$75,000–$100,000
	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE
MCIB	.34	.72	.86	1.14	.44	.94	.37	1.01
CDF	.31	.75	.88	1.23	.38	1.01	.33	1.15
Lorenz	.32	.67	.74	1.03	.04	.87	−.11	1.09
N = 1,185
	$100,000–$125,000		$125,000–$150,000		$150,000–$200,000		≥$200,000		Total Closed Bin Income
	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE
MCIB	.54	1.04	.16	.99	−.40	1.96	−3.23	7.26	.47	.62
CDF	.58	1.12	.11	1.07	−.11	1.92	−2.64	5.48	.54	.68
Lorenz	−.32	1.12	−.60	1.23	−2.01	2.50	1.51	5.08	−.25	.45
N = 1,185

Note: CDF = cumulative density function; MCIB = mean-constrained integration over brackets; RMSE = root mean squared error.

Among the 16 bin mean estimates shown in Table 2, MCIB had the lowest relative bias for 1 bin and the lowest relative RMSE for 4 bins, CDF interpolation had the lowest relative bias for 5 bins and the lowest RMSE for 1 bin, and Lorenz interpolation had the lowest relative bias for 10 bins and the lowest relative RMSE for 11 bins. Lorenz interpolation also had the lowest relative bias and RMSE for the total closed bin income. Both MCIB and CDF interpolation produced positively biased estimates of the total closed bin income. As a result, these methods produced negatively biased estimates of the top bin mean.

The greater accuracy of Lorenz interpolation can be attributed in part to how this method estimates the top bin mean, but this does not tell the whole story. Although Atkinson index estimates based on Lorenz interpolation had slightly higher relative bias than MCIB Atkinson index estimates, the relative RMSE of MCIB Atkinson index estimates was significantly higher. This reflects the latter method’s use of the Pareto distribution to approximate the upper tail of the income distribution. Figure 3 shows scatterplots for the residuals of Atkinson index estimates from all three estimation methods. The solid lines denote the bias associated with each estimation method, and the dashed lines represent a 1 standard deviation distance from the solid lines. Comparing the space between the dashed lines in each plot, Lorenz interpolation had more reliable estimates than those produced using the other methods. The difference in reliability of Lorenz interpolation estimates is even larger at the school district and census tract levels.

Figure 3.

Residuals from Atkinson estimates.

Estimating Tract-Level and School District–Level Inequality Measures

Estimates of income inequality based on grouped data are less accurate for small regions (von Hippel et al. 2016). This is due chiefly to sampling variation, but it may also reflect the heterogeneity of income distributions associated with smaller regions. The income distributions of neighborhoods or municipalities vary more than those of larger regions such as metropolitan areas, which encompass entire regional economies and may resemble each other more. This raises the question of whether estimation techniques such as Lorenz interpolation can be used to produce valid income inequality estimates for small areas.

Table 3 shows the error terms for tract-level and school district–level estimates. Note the size of the relative RMSEs: these errors are larger for tracts than for school districts, and they are larger for school districts than for PUMAs. Comparing MCIB and CDF interpolation, the latter method produced Theil coefficient and Atkinson index estimates with lower relative RMSEs at the tract and school district levels. CDF interpolation also outperformed MCIB at estimating the standard deviation at the tract level, but the two methods performed comparably at the school district level. MCIB estimates tended to have lower relative bias than CDF estimates, but the relative RMSEs indicate this difference in bias is outweighed by the variance of the MCIB estimates.

Table 3.

Tract and School-District Inequality Error Terms

	Gini Coefficient				Theil Coefficient
	Tract		School District		Tract		School District
	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE
MCIB	−1.00	2.33	−1.01	1.41	−2.32	23.60	−7.18	14.70
CDF	−1.18	2.32	−1.44	1.76	−4.65	7.82	−6.67	7.80
Lorenz	−.08	1.93	−.41	1.04	−.55	6.59	−2.15	4.86
	Atkinson Index				S.D.
	Tract		School District		Tract		School District
	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE
MCIB	−1.20	15.60	−3.19	13.47	−.31	18.33	−5.97	12.64
CDF	−4.62	9.01	−5.99	7.96	−5.02	10.50	−9.36	12.24
Lorenz	−.82	6.13	−1.99	4.33	.15	8.54	−3.44	8.43
N	69,675		13,360		69,675		13,360

Source: American Community Survey, 2011 to 2015.

Note: CDF = cumulative density function; MCIB = mean-constrained integration over brackets; RMSE = root mean squared error.

Compared with MCIB and CDF interpolation, Lorenz interpolation produced slightly more accurate estimates of the Gini coefficient and significantly more accurate estimates of the Theil coefficient, Atkinson index, and standard deviation. At the school district level, Theil coefficient, Atkinson index, and standard deviation estimates based on Lorenz interpolation had 31 percent to 46 percent lower relative RMSEs than those based on the next best method, CDF interpolation. The gap between MCIB and Lorenz interpolation was even larger: the relative RMSEs from Theil coefficient, Atkinson index, and standard deviation estimates based on Lorenz interpolation were about 33 percent to 67 percent lower than those based on MCIB. Tract-level error metrics from Lorenz interpolation were also lower than those based on the other methods. However, these relative RMSEs were large (6.5 percent to 8.5 percent). Furthermore, these numbers do not account for sampling variation. For income data based on the five-year pooled ACS, the influence of sampling variation on tract-level estimates is substantial.

In summary, Lorenz interpolation, CDF interpolation, and MCIB produced accurate Gini coefficient estimates at the tract and school district levels. These methods seem to be up to the task of estimating income inequality measures that depend less on the upper tail of the income distribution. CDF produced more accurate estimates of the Theil coefficient, Atkinson index, and standard deviation than did MCIB, but Lorenz interpolation outperformed both methods at estimating these measures.

Quantile and Income Share Estimates from MCIB, CDF Interpolation, and Lorenz Interpolation

Table 4 shows relative bias and relative RMSE terms for MCIB, CDF interpolation, and Lorenz interpolation estimates of income quantiles and income shares. Income quantile estimates were produced for the 20th, 40th, 60th, 80th, and 95th percentiles of the income distribution at the PUMA level. Lorenz interpolation estimates had the lowest relative bias and RMSE for all percentiles except at the 80th percentile, where MCIB produced the most accurate estimates. Moving to income shares, I estimated statistics for each quintile of the income distribution. Lorenz interpolation produced the most accurate estimates of all quintiles except the bottom quintile, for which MCIB produced the most accurate estimates.

Table 4.

Quantile and Income Share Error Terms for PUMAs

	Q20		Q40		Q60		Q80
	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE
MCIB	.88	1.85	.65	1.20	.54	1.10	.27	1.57
CDF	.84	1.81	.61	1.21	.49	1.13	.49	1.73
Lorenz	.69	1.76	.51	1.18	.22	1.06	−.66	1.80
N = 1,185
	Q95		Income Share 20		Income Share 40		Income Share 60
	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE
MCIB	−2.87	6.58	2.88	5.36	1.16	1.73	.86	1.53
CDF	2.48	5.51	4.91	9.69	1.30	5.84	.91	5.67
Lorenz	.54	4.97	3.10	5.41	.89	1.55	.47	1.35
N = 1,185
	Income Share 80		Income Share 100
	Relative Bias	Relative RMSE	Relative Bias	Relative RMSE
MCIB	.54	1.26	−.90	1.06
CDF	.76	5.21	−1.21	3.44
Lorenz	−.21	1.25	−.38	.72
N = 1,185

Note: CDF = cumulative density function; MCIB = mean-constrained integration over brackets; RMSE = root mean squared error.

The greater accuracy of the Lorenz interpolation quantile estimates compared with CDF interpolation is particularly surprising given that the latter method derives income statistics from the CDF, which is the inverse of the percentile function. Note, however, that Lorenz interpolation estimates of the income quantiles and income shares were only slightly more accurate than CDF interpolation or MCIB estimates (a few hundredths of a percent in many cases). Although Lorenz interpolation can be used to estimate income quantiles, income shares, and other local statistics, its main utility is for estimating income inequality.

Discussion

I conclude with a discussion of the use cases of Lorenz interpolation, some contexts in which the method should not be applied, and some ways the method could be extended. Starting with use cases, Lorenz interpolation is a viable method for estimating income inequality at geographic levels that are not accounted for in census microdata but for which census summary table data are available. For instance, PUMS data only cover a subset of U.S. counties.¹⁷ Researchers interested in examining income inequality for all U.S. counties must rely on grouped income data provided in census summary tables. For geographies such as these, including counties and census-designated places, using Lorenz interpolation on summary census data is a valid way to estimate income inequality.

Lorenz interpolation can also be used to estimate income distributions for geographies that are not provided in census data but can be approximated by aggregating incomes from a geographic level that is provided in the census. For example, Owens’s (2016) recent work on U.S. economic segregation, which is organized around school district boundaries, uses the robust Pareto-midpoint estimator (von Hippel et al. 2016), an improved version of the technique of imputing bracket midpoints for incomes in closed brackets and assigning a Pareto distribution mean to incomes in the top bracket. For a study such as this one, Lorenz interpolation would be a preferable method for estimating income inequality. Alternatively, researchers interested in the implications of income inequality for disparities in the quality of local public services may wish to approximate municipal income distributions by aggregating tract-level data to the municipal level. For such an analysis, Lorenz interpolation would yield more accurate estimates of municipal income inequality and should be used in lieu of Pareto-midpoint estimators and the other methods discussed here.

There are many use cases for Lorenz interpolation, but there are also situations where this method should not be used, either because the PUMS data are sufficient or grouped data are inadequate. For example, researchers who require income statistics at the metropolitan statistical area or state level should simply use PUMS data, which include geographic information for large regions such as metropolitan statistical areas and states. Conversely, although grouped income data are the only publicly available resource for studying incomes at lower geographic levels such as census tracts and blocks, researchers should be wary of estimating certain inequality measures from these data. The errors associated with tract-level Theil coefficient, standard deviation, and Atkinson index estimates produced in this analysis are too large for some analyses.

The large residuals of these estimates are even more concerning when one considers that census income data are sample data. Until 2000, these data were collected in the long-form portion of the decennial census, which is based on a sample covering approximately 18 percent of the U.S. population (Logan et al. 2018). Since then, income data have been collected in the ACS, which in its five-year form is based on a sample of about 5 percent of the population. Although the ACS provides error margins that can be used to construct 90 percent confidence intervals around the frequency estimates of each bin from the grouped income data, researchers have yet to develop methods for estimating income inequality that make use of these margins. The general approach has been to ignore them and work at a high enough geographic level that the error caused by sampling variation is negligible.¹⁸

To improve on the method put forth in this article, researchers may want to consider using Bayesian methods to produce more reliable income inequality estimates for small areas. Empirical Bayes might be a viable method to supplement tract-level income data with information from neighboring regions. Note, however, that income data from the Census Bureau are already reweighted to incorporate demographic and other information from neighboring areas. Shrinking estimates from sparsely populated tracts toward the inequality levels of surrounding areas may have a limited effect on improving the reliability of estimates based on these data. Nonetheless, studies have successfully used Bayesian methods to improve small area estimates using census data from other countries’ national censuses (Assunção et al. 2005; Schmertmann and Gonzaga 2018). These methods may yield better estimators for small regions, particularly for unweighted income data.

Researchers should also evaluate the utility of Lorenz interpolation for estimating statistics from other kinds of grouped data. This study looked exclusively at income data from the census. Grouped data from this source are lower bound inclusive and upper bound exclusive. As a result, incomes that fall directly on an income boundary are assigned to the higher of the two bins subdivided by that boundary. This has a negative effect on the true bin means associated with the income groups. For data that apply different rules for handling incomes falling on the income boundaries, the improvement of Lorenz interpolation may be smaller. Researchers could determine this by comparing the performance of Lorenz interpolation, MCIB, and CDF interpolation using data from other national censuses.

Conclusion

In this article, I proposed a new method, Lorenz interpolation, for estimating income inequality from grouped income data. I showed that this method produces significantly more accurate and reliable estimates of income inequality at the PUMA, school district, and tract levels. I also showed that Lorenz interpolation produces slightly better estimates of income quantiles and income shares. Finally, I provided some scope conditions for the use of Lorenz interpolation. Although Lorenz interpolation produced more accurate inequality estimates at the tract level, these estimates are insufficiently reliable. Lorenz interpolation yielded more reliable estimates at the school district level. Lorenz interpolation especially outperforms the other methods at estimating income inequality measures such as the Theil coefficient that are sensitive to the upper tail of the income distribution. As the contribution of this upper tail to income inequality continues to grow (Piketty and Goldhammer 2014), methods for estimating these measures will become increasingly important.

Footnotes

Acknowledgements

I am grateful for the helpful comments of Martin Ruef, Colin Birkhead, William Grider, and members of the Duke Economic Sociology Workshop.

ORCID iD

Andrew Carr

Notes

Author Biography

Andrew Carr holds a PhD in sociology and a master’s degree in statistics from Duke University. His research centers on the social and economic implications of income inequality and economic segregation.

References

Assunção

Renato M.

Schmertmann

Carl P.

Potter

Joseph E.

Cavenaghi

Suzana M.

2005. “Empirical Bayes Estimation of Demographic Schedules for Small Areas.” Demography 42(3):537–58.

Atkinson

Anthony B.

1970. “On the Measurement of Inequality.” Journal of Economic Theory 2(3):244–63.

Chetty

Raj

Grusky

David

Hell

Maximilian

Hendren

Nathaniel

Manduca

Robert

Narang

Jimmy

. 2017. “The Fading American Dream: Trends in Absolute Income Mobility since 1940.” Science 356(6336):398–406.

Cowell

Frank A.

Mehta

Fatemeh

. 1982. “The Estimation and Interpolation of Inequality Measures.” Review of Economic Studies 49(2):273.

Galster

George

Sharkey

Patrick

. 2017. “Spatial Foundations of Inequality: A Conceptual Model and Empirical Overview.” Russell Sage Foundation Journal 3(2):1–33.

Gastwirth

Joseph L.

1971. “A General Definition of the Lorenz Curve.” Econometrica 39(6):1037.

Gastwirth

Joseph L.

Glauberman

Marcia

. 1976. “The Interpolation of the Lorenz Curve and Gini Index from Grouped Data.” Econometrica 51(3):49–51.

Gastwirth

Joseph L.

Nayak

Tapan K.

Krieger

Abba M.

1986. “Large Sample Theory for the Bounds on the Gini and Related Indices of Inequality Estimated from Grouped Data.” Journal of Business and Economic Statistics 4(2):269–73.

Geverdt

Douglas E.

2019. “Education Demographic and Geographic Estimates Program (EDGE): School District Geographic Relationship Files User’s Manual.”Washington, DC: U.S. Department of Education, National Center for Education Statistics.

10.

Grusky

David B.

MacLean

Alair

. 2016. “The Social Fallout of a High-Inequality Regime.” Annals of the American Academy of Political and Social Science 663(1):33–52.

11.

Hajargasht

Gholamreza

Griffiths

William E.

Brice

Joseph

Prasada Rao

D. S.

Chotikapanich

Duangkamon

. 2012. “Inference for Income Distributions Using Grouped Data.” Journal of Business and Economic Statistics 30(4):563–75.

12.

Heitjan

Daniel F.

1989. “Inference from Grouped Continuous Data: A Review.” Statistical Science 4:164–79.

13.

Henson

Mary F.

1967. “Trends in the Income of Families and Persons in the United States, 1947–1964.”Washington, DC: U.S. Department of Commerce, Bureau of the Census.

14.

Hunter

David J.

Drown

McKalie

. 2016. “binsmooth: Generate PDFs and CDFs from Binned Data.” R Package Version 0.2.2. Retrieved March 15, 2022. https://cran.r-project.org/web/packages/binsmooth/.

15.

Jargowsky

Paul A.

1996. “Take the Money and Run: Economic Segregation in U.S. Metropolitan Areas.” American Sociological Review 61(6):984–98.

16.

Jargowsky

Paul A.

2019. “MCIB: Stata Module to Estimate Income Distribution and Inequality Statistics from Grouped Data.” Statistical Software Components S458660, Boston College, Department of Economics.

17.

Jargowsky

Paul A.

Wheeler

Christopher A.

2018. “Estimating Income Statistics from Grouped Data: Mean-Constrained Integration over Brackets.” Sociological Methodology 48(1):337–74.

18.

Kakamu

Kazuhiko

. 2016. “Simulation Studies Comparing Dagum and Singh–Maddala Income Distributions.” Computational Economics 48(4):593–605.

19.

Kakwani

Nanak

. 1976. “On the Estimation of Income Inequality Measures from Grouped Observations.” Review of Economic Studies 43(3):483–92.

20.

Logan

John R.

Foster

Andrew

Jun

Fan

. 2018. “The Uptick in Income Segregation: Real Trend or Random Sampling Variation?” American Journal of Sociology 124(1):185–222.

21.

McDonald

James B.

Yexiao J.

1995. “A Generalization of the Beta Distribution with Applications.” Journal of Econometrics 66(1–2):133–52.

22.

Minoiu

Camelia

Reddy

Sanjay G.

2012. “Kernel Density Estimation on Grouped Data: The Case of Poverty Assessment.” Journal of Economic Inequality 12(2):163–89.

23.

Nielsen

Francois

Alderson

Arthur S.

1997. “The Kuznets Curve and the Great U-Turn: Income Inequality in U.S. Counties, 1970 to 1990.” American Sociological Review 62(1):12–33.

24.

Owens

Ann

. 2016. “Inequality in Children’s Contexts: Income Segregation of Households with and without Children.” American Sociological Review 81(3):549–74.

25.

Owens

Ann

. 2019. “Building Inequality: Housing Segregation and Income Segregation.” Sociological Science 6:497–525.

26.

Pickett

Kate E.

Kelly

Shona

Brunner

Eric

Lobstein

Tim

Wilkinson

Richard G.

2005. “Wider Income Gaps, Wider Waistbands? An Ecological Study of Obesity and Income Inequality.” Journal of Epidemiology and Community Health 59(8):670–74.

27.

Pickett

Kate E.

Wilkinson

Richard G.

2007. “Child Wellbeing and Income Inequality in Rich Societies: Ecological Cross Sectional Study.” British Medical Journal 335(7629):1080–85.

28.

Piketty

Thomas

Goldhammer

Arthur

. 2014. Capital in the Twenty-First Century. Cambridge, MA: Belknap.

29.

Quandt

Richard E.

1966. “Old and New Methods of Estimation and the Pareto Distribution.” Metrika 10(1):55–82.

30.

Reardon

Sean F.

Bischoff

Kendra

. 2011. “Income Inequality and Income Segregation.” American Journal of Sociology 116(4):1092–1153.

31.

Ruggles

Steven

Flood

Sarah

Foster

Sophia

Goeken

Ronald

Pacas

Jose

Schouweiler

Megan

Sobek

Matthew

. 2021. “IPUMS USA: Version 11.0.”Minneapolis, MN: IPUMS.

32.

Schmertmann

Carl P.

Gonzaga

Marcos R.

2018. “Bayesian Estimation of Age-Specific Mortality and Life Expectancy for Small Areas with Defective Vital Records.” Demography 55:1363–88.

33.

Tillé

Yves

Langel

Matti

. 2012. “Histogram-Based Interpolation of the Lorenz Curve and Gini Index for Grouped Data.” American Statistician 66(4):225–31.

34.

U.S. Census Bureau. 2015. “American Community Survey and Puerto Rico Community Survey 2015 Subject Definitions.” Retrieved March 15, 2022. https://www2.census.gov/programs-surveys/acs/tech_docs/subject_definitions/2015_ACSSubjectDefinitions.pdf.

35.

von Hippel

Paul T.

Hunter

David J.

Drown

McKalie

. 2017. “Better Estimates from Binned Income Data: Interpolated CDFs and Mean-Matching.” Sociological Science 4:641–55.

36.

von Hippel

Paul T.

Scarpino

Samuel V.

Holas

Igor

. 2016. “Robust Estimation of Inequality from Binned Incomes.” Sociological Methodology 46(1):212–52.