Including non-binary gender in the calibration strategy for the Canadian long-form sample survey weights

Abstract

Due to global events impacting social and economic landscapes, the spotlight on inequalities endured by marginalized and vulnerable groups has intensified, necessitating action from policymakers to create a more equitable future for all. It is essential that National Statistics Offices (NSOs) provide detailed statistical data which highlights the experiences of these marginalized groups to ensure that fairness and inclusion are key components of evidence-based policy. Aligning with these principles, in 2021 Canada became the first country to collect and disseminate data on gender diversity in a national census giving Canadians the option to select male, female, or non-binary. Due to their small size, non-binary population totals were not used in the 2021 Census long-form sample calibration due to the risk of increasing the variance of estimates. This paper presents an alternative long-form calibration strategy which allows for small populations, such as non-binary individuals, to be incorporated while mitigating methodological concerns. The strategy put forward can incorporate multiple small populations simultaneously while also being adaptable to the calibration systems of other NSOs. The results of a Monte Carlo simulation are presented showing improved data quality for the non-binary population under the alternative calibration strategy.

Keywords

linear calibration dis-aggregated data Canadian census of population decomposed optimization

1 Introduction

A key pillar of the 2030 United Nations Sustainable Development Goals (UNSDG) is the principle of leaving no one behind. In recent years, UN Member-States have recognized that growing inequalities amongst marginalized groups are international challenges which must be addressed.¹ In response, Member-States have committed to reducing the inequalities faced by vulnerable groups in their country. In fact, the United Nations Statistical Commission (UNSC) developed a framework highlighting that “Sustainable Development Goal indicators should be disaggregated, where relevant, by income, sex, age, race, ethnicity, migratory status, disability and geographic location or other characteristics, in accordance with the Fundamental Principles of Official Statistics”.² To ensure that fairness and inclusion are a key part of evidence-based policy, it is essential that National Statistics Offices (NSOs) provide detailed statistical data that highlights the experiences of marginalized groups.

Aligning with these principles, in 2021 Canada became the first country to collect and disseminate data based on gender diversity in a national census.³ The Canadian census of population program provides a detailed statistical portrait of the Canadian population, playing an important role in data-driven policymaking. Conducted every 5 years, 2021 being the most recent, this program consists of 2 parts: a census which enumerates the entire population and collects basic demographic information and the long-form (LF) sample survey, which collects more detailed social and economic data on approximately 25% of the population. In 2021 the census portion included a new question measuring gender using the cisgender, transgender and non-binary categories. Cisgender refers to individuals who have reported that their sex assigned at birth is the same as their current gender whereas transgender refers to individuals whose sex assigned at birth does not align with their gender. Non-binary refers to individuals whose gender is not exclusively “man” or “woman”.³ The addition of this question created an avenue to publish data by both sex and gender crossed with intersectionality variables to better understand the inequalities faced by gender minorities.

As NSOs strive to fill data gaps with disaggregated data, it raises the question of sound methods to produce high quality disaggregated estimates.⁴ Calibration, introduced by Deville & Särndal,⁵ is a weighting procedure commonly used by survey practitioners to improve the quality of estimates derived from survey data. Its goals are to increase the precision of estimates, address unit non-response, correct coverage errors and force estimates to match known control totals. To perform calibration, NSOs leverage a rich set of auxiliary information obtained from administrative sources, censuses and other surveys. However, when disaggregating data on marginalized populations, resulting cells often have small population sizes especially at lower levels of geography. Calibrating to small population totals can be problematic since some samples may have pre-calibrated sample totals which are comparatively far from the controls. In this scenario, extreme calibration weights – which are known to increase the variance of estimates – may be required to achieve coherence with the controls. Because of this, census counts of non-binary persons were not used in the 2021 LF sample calibration. Instead, a 2-category gender variable with categories men + and women + was used.

This paper proposes an alternative calibration strategy to incorporate small populations into the census LF calibration process while mitigating the methodological concerns. The application of the proposed strategy to the non-binary population is demonstrated. However, the solution can be extended to incorporate multiple small populations simultaneously while being adaptable to the calibration systems of other NSOs. The remainder of the paper is organized as follows. Section 2 describes the usual LF calibration procedure. Section 3 presents the proposed alternative LF calibration strategy. Section 4 describes the results of a Monte Carlo (MC) simulation which studied the impact on data quality of calibration to non-binary population counts. The main conclusions are summarized in Section 5.

2 Long-form estimation strategy

The LF sample is proportionately distributed across the country to ensure a high reliability of estimates for all areas. To improve the efficiency of field operations, Canada is partitioned into 49,217 geographical areas called collection units (CUs). In each CU, a 25% systematic sample of dwellings is drawn to constitute the LF sample. An initial set of design weights are constructed at the household-level as the inverse of the sampling fraction. Then, these design weights undergo a series of adjustments to address coverage and unit non-response errors. After this, a final calibration procedure is applied which is the focus of this paper. After the final calibration procedure, all persons within a household receive a person-level weight equal to their final household weight.

During LF calibration, household weights are adjusted to reduce sampling variance and ensure numerical consistency of estimates with known census counts. Calibration is carried out independently and in parallel within Super Aggregate Dissemination Areas (SADAs). These are standard geographic areas created for the LF weighting process which partition the country. SADAs are contiguous areas respecting provincial/territorial boundaries whose sizes are between 50,000 and 150,000 persons. Also used in the weighting process are Aggregate Dissemination Areas (ADAs) which are partitions of SADAs whose sizes are between 5,000 and 15,000 persons. Figure 1 illustrates the hierarchical relationship beween provinces/territores, SADAs and ADAs.

A selection process is carried out within each area to determine the set of population characteristics, called calibration constraints, to calibrate on. One set of constraints is selected for each SADA and each of the ADAs comprising each SADA. Due to the hierarchical nature of SADAs and ADAs, ADA-level constraints from a mathematical standpoint are analogous to SADA constraints. For the remainder of this paper, SADA-level constraints will refer to both SADA and ADA constraints. Constraint selection is an iterative procedure. The first step consists of defining a set of candidate constraints which is the same for each SADA. Then, constraints with a population count less than 200 are removed to protect against inflating the variance of estimates. Lastly, constraints are removed based on collinearity and explanatory redundancy. Once constraints are selected, calibrated weights for households in each SADA are obtained by solving the linear calibration problem stated by Deville & Särndal.⁵ This involves minimizing a Chi-squared type distance function subject to the constraints:

\begin{matrix} \begin{matrix} minimize \sum_{k \in s_{i}} \frac{{(w_{k} - d_{k})}^{2}}{2 d_{k}} \end{matrix} \end{matrix}

(1)

\begin{matrix} \begin{matrix} subject to \sum_{k \in s_{i}} w_{k} x_{k}^{(i)} = t_{x^{(i)}} a n d 1 \leq w_{k} \leq 20, \end{matrix} \end{matrix}

where

s_{i}

is the set of households undergoing calibration in SADA i, for household k,

w_{k}

is the calibrated weight,

d_{k}

is the pre-calibrated weight,

x_{k}^{(i)}

is the vector of auxiliary variables corresponding to the selected constraints and

t_{x^{(i)}}

is the vector of census counts for

x_{k}^{(i)}

. To add further protection against extreme calibration weights, each calibrated weight is constrained between

1

and

20

, referred to as a range constraint. The resulting calibration weight can be written as the pre-calibrated weight times an adjustment factor relying on

x_{k}^{(i)}

and estimated model parameters

0 < {\hat{Φ}}_{k} \leq 1

and

{\hat{λ}}_{k}

\begin{matrix} w_{k} = d_{k} (1 + {\hat{Φ}}_{k} {\hat{λ}}_{k}^{T} x_{k}^{(i)}) . \end{matrix}

(2)

The algorithm described in⁶ can be used to estimate the parameters

Φ_{k}

and

λ_{k}

Φ_{k}

adjusts the weight to meet the range constraint and

λ_{k}

measures sample imbalance. The LF calibration problem is solved using the Generalized Estimation System (G–EST), an internal tool developed by Statistics Canada.

3 Calibrating long-form weights to non-binary population totals

In 2021, if non-binary population counts were evaluated by the constraint selection process, they would have been filtered out in most areas due to their small population sizes. Instead of calibrating non-binary persons to SADA and ADA population totals, the proposed strategy calibrates weights to provincial/territorial totals. This mitigates the risk of increasing the variance due to provinces/territories being aggregates of SADAs and therefore having larger population sizes. For brevity, provinces and territories will be referred to as “provinces” going forward. The proposed strategy maintains the usual SADA-level constraints alongside the newly introduced provincial-level constraint, which shifts the calibration problem from the SADA to provincial level. In other words, within a province, household weights are calibrated to the provincial and SADA constraints simultaneously. The resulting calibration will be referred to as the provincial problem as stated in section 3.1. The provincial problem is solvable using standard optimization methods however, for many provinces, the increased size of the calibration due to provincial-level processing requires computing power and resources which exceeds those available in the LF calibration environment. For example, the province of Ontario – the largest in Canada – is comprised of 150 SADAs. Instead of solving 150 SADA problems in parallel, a single problem, roughly 150 times larger must be solved. This issue is addressed in section 3.2 which presents the novel contribution of this paper – a decomposition scheme to partition the large provincial problem back into independent SADA problems of a manageable size.

The hierarchical relationship between provinces and SADAs allows for the aggregation of small population totals at the SADA level into a larger provincial constraint facilitating the use of small populations, albeit at a higher geographical level. Although the relationship between a province and SADAs are specific to the Canadian census, many NSOs maintain similar geographical hierarchies to facilitate sampling and estimation procedures which can be leveraged. For example, the US Census Bureau maintains a hierarchical relationship between states, counties and census tracts.⁷

3.1 The provincial non-binary calibration problem

Let $s^{(p)}$ be the set of households undergoing calibration within a given province and I be the number of SADA in the province. The calibrated weights are the solution to the following optimization problem:

\begin{matrix} \begin{matrix} minimize \sum_{k \in s^{(p)}} \frac{{(w_{k} - d_{k})}^{2}}{2 d_{k}}, \end{matrix} \end{matrix}

(3)

\begin{matrix} \begin{matrix} subject to \sum_{k \in s^{(p)}} w_{k} {\tilde{x}}_{k} = t_{\tilde{x}} a n d 1 \leq w_{k} \leq 20. \end{matrix} \end{matrix}

For household

k

in SADA i,

{\tilde{x}}_{k} = [x_{k *}, 0, \dots, 0, x {_{k}^{(i)}}^{T}, 0,

\dots, 0]^{T}

is a vector containing the non-binary status variable in the first entry and the vector of SADA-level auxiliary variables in the

(i + 1)^{t h}

entry. The total vector

t_{\tilde{x}} = [t_{*}, {t_{x^{(1)}}}^{T}, {t_{x^{(2)}}}^{T}, \dots, {t_{x^{(I)}}}^{T}]^{T}

contains the provincial non-binary person total in the first entry and the SADA-level total vectors for each SADA in the province.

3.2 Decomposing the provincial calibration problem

The decomposition scheme relies on first re-writing the provincial non-binary person constraint as a sum of SADA-level constraints:

\begin{matrix} \sum_{k \in s^{(p)}} w_{k} x_{k *} - t_{*} = \sum_{i = 1}^{I} {\sum_{k \in s_{i}} w_{k} x_{k *} - t_{i *}} = 0. \end{matrix}

(4)

The quantities

t_{i *}, i = 1, \dots, I

are referred to as artificial SADA totals of non-binary people or artificial totals for short. For equation (4) to be satisfied, the artificial totals must sum to the provincial total:

\begin{matrix} \sum_{i = 1}^{I} t_{i *} = t_{*} . \end{matrix}

(5)

The term “artificial totals” are used because the

t_{i *}

are not meant to represent the true SADA-level totals. Rather, the artificial totals are a tool to partition the provincial constraint into independent SADA-level constraints allowing for the decomposition of the calibration. In each SADA, the following constraint is introduced which calibrates the estimated number of non-binary persons to the artificial SADA total:

\begin{matrix} \sum_{k \in s_{i}} w_{k} x_{k *} = t_{i *} \end{matrix}

(6)

The provincial problem is then decomposed into SADA-level problems by augmenting the SADA problem auxiliary and total vectors with the non-binary status variable and artificial total respectively. The resulting decomposed calibration problem for SADA i is:

\begin{matrix} \begin{matrix} minimize \sum_{k \in s_{i}} \frac{{(w_{k} - d_{k})}^{2}}{2 d_{k}} \end{matrix} \end{matrix}

(7)

\begin{matrix} \begin{matrix} subject to \sum_{k \in s_{i}} w_{k} x_{k *}^{(i)} = t_{*}^{(i)} a n d 1 \leq w_{k} \leq 20, \end{matrix} \end{matrix}

where

x_{k *}^{(i)} = [x_{k *}, x {_{k}^{(i)}}^{T}]^{T}

and

t_{*}^{(i)} = [t_{i *}, {t_{x^{(i)}}}^{T}]^{T} .

If equation (5) is respected then calibration to the artificial totals will ensure the provincial constraint is satisfied. However, for any arbitrary choice of

t_{i *}, i = 1, . . ., I

satisfying (5), it is not guaranteed that the calibration weights obtained from the decomposed problems will be equal to the weights that would have been obtained from solving the provincial problem. In fact, our studies have shown that choices of

t_{i *}

satisfying (5) which are far from the un-calibrated SADA–level estimated totals can lead to extreme calibration weights. In the following section, the artificial totals are mathematically derived to ensure the calibration weights obtained from the decomposed SADA problems (7) are equal to the calibration weights that would have been obtained from solving the large provincial problem (3).

Figure 1.

Hierarchical representation of the standard geographies used in the Canadian census calibration process.

Figure 2.

MCRB of point estimates: with non-binary calibration plotted against without non-binary calibration at the SADA-level by estimate type.

Figure 3.

MCCV of point estimates: with non-binary calibration plotted against without non-binary calibration at the SADA-level by estimate type.

Figure 4.

MCRB of estimated SEs: with non-binary calibration plotted against without non-binary calibration at the SADA-level by estimate type.

Figure 5.

MCCV of estimated SEs: with non-binary calibration plotted against without non-binary calibration at the SADA-level by estimate type.

3.3 Deriving the artificial SADA-level totals of non-binary persons

The derivation is carried out independently in each SADA by setting the solution to the calibration weights obtained from the SADA and provincial problems equal and then solving for $t_{*}^{(i)}$ , which contains the artificial total in the first entry. It's noted that finding $t_{i *}$ in this manner implicitly satisfies (5). To facilitate the derivation, the SADA problem auxiliary and total vectors are re-written such that their dimension matches their provincial problem counterparts. The vectors for unit k in SADA i are re-written as ${\tilde{x}}_{k} = [x_{k *}, 0, \dots, 0, x {_{k}^{(i)}}^{T}, 0, \dots, 0]^{T}$ and $t_{*}^{(i)} = [t_{i *}, 0, \dots, 0, {t_{x^{(i)}}}^{T}, 0, \dots, 0]^{T}$ , where the first entry contains the artificial non-binary persons constraint and the SADA-level constraints are in the $(i + 1)^{t h}$ entry. The remaining elements are 0. This corresponds to partitioning the provincial problem constraints into their SADA specific components while using the artificial total in place of the provincial version. Note that the resulting constraint $\sum_{k \in s_{i}} w_{k} {\tilde{x}}_{k} = t_{*}^{(i)}$ is equivalent to the constraints in (7) but with 0 s in place of the SADA j, $j \neq i$ elements. Using equation (2), the formulas for the calibration weights are set equal to on another. Components belonging to the provincial problem are denoted using the superscript p and components from the SADA-level problem using $s .$ For household k in SADA i the following equality is obtained:

\begin{matrix} \begin{matrix} w_{i, k}^{(p)} = w_{i, k}^{(s)} \end{matrix} \end{matrix}

(8)

\begin{matrix} \Leftrightarrow d_{k} (1 + {\hat{Φ}}_{i, k}^{(p)} {\hat{λ}}_{i, k}^{(p) T} {\tilde{x}}_{k}) = d_{k} (1 + {\hat{Φ}}_{i, k}^{(s)} {\hat{λ}}_{i, k}^{(s) T} {\tilde{x}}_{k}) \end{matrix}

\begin{matrix} \Leftrightarrow {\hat{Φ}}_{i, k}^{(p)} {\hat{λ}}_{i, k}^{(p)} = {\hat{Φ}}_{i, k}^{(s)} {\hat{λ}}_{i, k}^{(s)} . \end{matrix}

Appendix A shows that the estimated parameters

{\hat{Φ}}_{i, k}

and

{\hat{λ}}_{i, k}

for the SADA and provincial problems will be equal when the artificial totals ensure the calibrated weights from the problems without range constraints are equal. Let

w_{i, k}^{- (p)}

and

w_{i, k}^{- (s)}

be the calibrated weights for household k in SADA i obtained from the provincial and SADA-level problems without range constraints. Using the solution from,⁵ the following equality is obtained:

\begin{matrix} \begin{matrix} w_{i, k}^{- (p)} = w_{i, k}^{- (s)} \end{matrix} \end{matrix}

(9)

\begin{matrix} \Leftrightarrow d_{k} (1 + {\hat{λ}}^{(p) T} {\tilde{x}}_{k}) = d_{k} (1 + {\hat{λ}}^{(s) T} {\tilde{x}}_{k}) \end{matrix}

\begin{matrix} \Leftrightarrow {\hat{λ}}^{(p)} = {\hat{λ}}^{(s)} \end{matrix}

\begin{matrix} \Leftrightarrow {[\sum_{k \in s^{(p)}} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T}]}^{- 1} (t_{\tilde{x}} - {\hat{t}}_{\tilde{x}}) = {[\sum_{k \in s_{i}} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T}]}^{- 1} (t_{*}^{(i)} - {\hat{t}}_{*}^{(i)}), \end{matrix}

where

{\hat{t}}_{\tilde{x}}

and

{\hat{t}}_{*}^{(i)}

are total vectors estimated using the pre-calibrated weights. Solving the above for

t_{*}^{(i)}

yields

\begin{aligned} \begin{matrix} t_{*}^{(i)} = \sum_{k \in s_{i}} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T} {[\sum_{k \in s^{(p)}} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T}]}^{- 1} (t_{\tilde{x}} - {\hat{t}}_{\tilde{x}}) + {\hat{t}}_{*}^{(i)}, \end{matrix} \end{aligned}

(10)

which contains the artificial total in the first entry. The artificial total in SADA i is a function of the auxiliary variables in the province, the provincial problem total vector and the pre-calibrated estimated totals for the provincial and SADA i problems.

3.4 Procedure to calibrate LF weights to non-binary population counts

The alternative calibration strategy is characterized by the following steps.

Carry out the constraint selection process.

Construct the vectors ${\tilde{x}}_{k}$ , $t_{\tilde{x}}$ , ${\hat{t}}_{\tilde{x}}$ and ${\hat{t}}_{*}^{(i)}, i = 1, \dots, I$ needed for equation (10).

Compute $t_{*}^{(i)}, i = 1, \dots, I$ and extract $t_{i *}$ from the first entry.

For each SADA, augment the auxiliary and total vectors with the non-binary status variable and artificial total respectively.

Use G-EST to solve the decomposed SADA calibration problems in parallel.

Repeat steps 2-5 for each replicate weight used for variance estimation.

Steps 2–4 constitute the changes made to the usual calibration procedure which consists of some data pre-processing and a matrix algebra step. Step 5 involves solving for the calibrated weights and is the most time-consuming step. This step remains unchanged other than the addition of the artificial non-binary constraint, highlighting the smooth integration of the solution into the existing calibration procedure.

The alternative calibration procedure can be extended to use multiple small population constraints aggregated at the provincial level. In this scenario, the non-binary status variable $x_{k *}$ is replaced by a vector of small constraint variables, $x_{k *}$ and the provincial total $t_{*}$ is replaced by a provincial total vector $t_{*}$ . Equation (10) remains unchanged and yields a vector of artificial totals $t_{i *}, i = 1, \dots, I$ .

4 Monte Carlo simulation

A Monte Carlo (MC) simulation was carried out to study the impact on data quality of calibrating LF weights to non-binary population totals using the proposed method. The study focused on data quality in the non-binary person domain since the proposed calibration procedure will mostly affect the weights of these units. The impact outside the non-binary person domain was also studied to ensure the method did not interfere with the high degree of data quality produced from the usual LF calibration. These results are not presented however minimal impacts were observed.

4.1 Setup

The simulation study was based on a pseudo-province created using the responses to the 2021 LF sample obtained in the province of British Columbia. The construction of the pseudo province involved the creation of pseudo SADA, ADA and DA areas which were made to be representative of true SADA, ADA and DAs in the Canadian population. DAs are partitions of ADAs and are close in size and concept to CUs. For efficiency reasons, pseudo DAs were created instead of pseudo CUs. The pseudo-province contained five pseudo-SADA totaling 472,034 people in 190,909 households, making it roughly the size of the province of Newfoundland and Labrador. 500 MC samples were drawn from the pseudo-province using a stratified simple random sample of households without replacement (SRSWOR). The sample was stratified by pseudo-DA and a sampling fraction of 25% was used. This sample design is very close to the design of the LF sample however, non-response was not simulated since the goal of the final calibration step is to reduce variance due to sampling and ensure consistency with census counts. In each MC sample, two sets of calibrated weights were produced. One set was constructed by carrying out the usual LF calibration process. A second set of weights was produced by calibrating to the set of constraints selected by the usual procedure plus an artificial non-binary SADA constraint derived for each sample.

Each MC sample also underwent the creation of replicate weights corresponding to the Partially Balanced Repeated Replication- $ε$ variance estimation methodology which is used to estimate the variance of LF sample estimates.⁸ Two sets of replicate weights were formed using 32 replicates and $ε$ = $\sqrt{0.375}$ . The replicate weights underwent the same calibrations as the main survey weight which provided a set of replicate weights calibrated with and without non-binary gender. This allowed for the impact of non-binary calibration on both the point estimates and their estimated standard error (SE) to be studied.

4.1.1 Analysis

Various statistics of interest were computed using both sets of calibrated weights. The properties of counts were the focus of the study as these make up most of the statistics published from the LF sample. However, some means were included in the analysis. The following statistics were studied:

Count of people.

Count of marital status (6 categories).

Count of highest education level attained (6 categories).

Mean income.

Mean age (results not included).

The above statistics and their estimated SEs were computed for the non-binary person-level domain at the pseudo-province, pseudo-SADA and pseudo-ADA levels. To study the statistical properties of the point and SE estimates, the Monte Carlo Relative Bias (MCRB) and Monte Carlo Coefficient of Variation (MCCV) were used with results expressed as a percentage.

4.2 Results

Figures 2–5 display the MCRB and MCCV for the point and SE estimates. Only results at the pseudo-SADA level are presented however similar results were observed at the pseudo-province and pseudo-ADA levels. Categories of marital status and education with a population count less than 30 were excluded from the analysis. The x-axis of the plots corresponds to the alternative calibration strategy (with non-binary) and the y-axis corresponds to the usual LF calibration (without non-binary). A reference line corresponding to $y = x$ was overlayed to assist the reader. The figures are summarized as follows:

Both calibration procedures produced approximately un-biased point estimates.

The alternative calibration procedure produced smaller MCCVs for counts and their estimated SEs. MCCVs of the mean (and its SE) were comparable between procedures.

$\circ$
Almost all points are above the reference line meaning the MCCVs associated with non-binary calibration are smaller than those without non-binary calibration.

Both calibration procedures produce estimated SEs which have a small (negative) MCRB. The two procedures are comparable.

5 Conclusion

This paper presented an alternative LF calibration strategy which allows for small populations, such as the non-binary group, to be incorporated while mitigating the methodological concerns. The proposed strategy derives artificial SADA totals to facilitate the calibration of LF weights to provincial counts of non-binary persons – adding protection against variance inflation. The artificial totals allow for the decomposition of the resulting large provincial-level problem back into computationally feasible SADA-level problems resulting in minimal changes to the current calibration system. The use of standardized geographies in official statistics allow the proposed calibration strategy to be adapted to large-scale establishment surveys outside the context of the Canadian LF sample. NSOs can leverage hierarchical relationships between geographies to aggregate small population constraints to higher populated levels while implementing artificial totals, under the Chi-Squared distance function, to decompose the problem back to the original operational geography. The proposed calibration strategy is continuing to be evaluated for implementation in the 2026 LF sample and has received interest from the Agency's Demographic Microsimulation project.

The simulation study demonstrated the positive impact that non-binary calibration had on data quality for the non-binary person domain. Non-binary calibration resulted in increased precision of point and SE estimates for characteristics of the non-binary person domain compared to calibration without the non-binary constraint. We have shown that including such sub-populations in calibration can improve data quality for marginalized and vulnerable groups. This can directly aid decision makers when drafting data-driven policy to address known inequalities and create a more equitable future for all.

Footnotes

ORCID iD

Alexander Imbrogno

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix

A) Showing ${{\hat{Φ}}_{i, k}^{(p)}, {\hat{λ}}_{i, k}^{(p)}} = {{\hat{Φ}}_{i, k}^{(s)}, {\hat{λ}}_{i, k}^{(s)}}$ when $w_{i, k}^{- (p)} = w_{i, k}^{- (s)} .$

From equations (9) and (10), for household k in SADA i,

\begin{aligned} w_{i, k}^{- (p)} & = w_{i, k}^{- (s)} \Leftrightarrow {\hat{λ}}^{(p)} = {\hat{λ}}^{(s)} \Leftrightarrow t_{*}^{(i)} \\ = \sum_{k \in s_{i}} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T} {[\sum_{k \in s^{(p)}} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T}]}^{- 1} (t_{\tilde{x}} - {\hat{t}}_{\tilde{x}}) + {\hat{t}}_{*}^{(i)} . \end{aligned}

Using the above

t_{*}^{(i)}

ensures the starting values of the provincial and SADA algorithms are equal as this guarantee

{\hat{λ}}^{(p)} = {\hat{λ}}^{(s)}

used to initialize

{\hat{λ}}_{i, k}^{(p)}

and

{\hat{λ}}_{i, k}^{(s)}

. From step 1 of the algorithm, it then follows that

\begin{aligned} {\tilde{w}}_{p, i, k}^{(1)} & = {\tilde{w}}_{s, i, k}^{(1)} = d_{k} (1 + \hat{Φ} {_{i, k}^{(p)}}^{(0)} \hat{λ} {_{i, k}^{(p) T}}^{(0)} {\tilde{x}}_{k}) \\ = d_{k} (1 + \hat{Φ} {_{i, k}^{(s)}}^{(0)} \hat{λ} {_{i, k}^{(s) T}}^{(0)} {\tilde{x}}_{k}), \end{aligned}

since

\hat{λ}_{i, k}^{{(p)}^{(0)}} = \hat{λ}_{i, k}^{{(s)}^{(0)}} =

{\hat{λ}}^{(p)} = {\hat{λ}}^{(s)}

and

\hat{Φ}_{i, k}^{{(p)}^{(0)}} = \hat{Φ}_{i, k}^{{(s)}^{(0)}} = 1

. If

{\tilde{w}}_{p, i, k}^{(1)} = {\tilde{w}}_{s, i, k}^{(1)}

falls outside the range restriction, they are adjusted in next step to meet the constraint, therefore

\hat{Φ}_{i, k}^{{(p)}^{[1]}} = \hat{Φ}_{i, k}^{{(s)}^{[1]}}

and

\begin{aligned} w_{p, i, k}^{(1)} & = w_{s, i, k}^{(1)} = d_{k} (1 + \hat{Φ}_{i, k}^{{(p)}^{[1]}} \hat{λ} {_{i, k}^{(p) T}}^{(0)} {\tilde{x}}_{k}) \\ = d_{k} (1 + \hat{Φ} {_{i, k}^{(s)}}^{[1]} \hat{λ} {_{i, k}^{(s) T}}^{(0)} {\tilde{x}}_{k}) . \end{aligned}

Next, set

\hat{λ}_{i, k}^{{(p)}^{(1)}} = \hat{λ}_{i, k}^{{(s)}^{(1)}}

from step 3.

\begin{aligned} \hat{λ}_{i, k}^{{(p)}^{(1)}} = \hat{λ}_{i, k}^{{(s)}^{(1)}} \end{aligned}

\begin{aligned} \Leftrightarrow \hat{λ}_{i, k}^{{(p)}^{(0)}} \frac{1}{\hat{Φ} {_{i, k}^{(p)}}^{[1]}} {[\sum_{k \in s^{p}} δ_{k} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T}]}^{- 1} (t_{\tilde{x}} \sum_{k \in s^{p}} δ_{k} d_{k} {\tilde{x}}_{k}) \\ = \hat{λ}_{i, k}^{{(s)}^{(0)}} \frac{1}{\hat{Φ} {_{i, k}^{(s)}}^{[1]}} {[\sum_{k \in s_{i}} δ_{k} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T}]}^{- 1} (t_{*}^{(i)} - \sum_{k \in s_{i}} δ_{k} d_{k} {\tilde{x}}_{k}) \end{aligned}

\begin{aligned} \Leftrightarrow {[\sum_{k \in s^{p}} δ_{k} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T}]}^{- 1} (t_{\tilde{x}} - \sum_{k \in s^{p}} δ_{k} d_{k} {\tilde{x}}_{k}) \\ = {[\sum_{k \in s_{i}} δ_{k} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T}]}^{- 1} (t_{*}^{(i)} - \sum_{k \in s_{i}} δ_{k} d_{k} {\tilde{x}}_{k}) \end{aligned}

where

δ_{k} = 0

{\tilde{w}}_{p, i, k}^{(1)} = {\tilde{w}}_{s, i, k}^{(1)}

was truncated in step 1 and 1 otherwise. The above is equal if and only if

\begin{aligned} t_{*}^{(i)} & = \sum_{k \in s_{i}} δ_{k} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T} {[\sum_{k \in s^{p}} δ_{k} d_{k} {\tilde{x}}_{k} {\tilde{x}}_{k}^{T}]}^{- 1} (t_{\tilde{x}} - \sum_{k \in s^{p}} δ_{k} d_{k} {\tilde{x}}_{k}) \\ + \sum_{k \in s_{i}} δ_{k} d_{k} {\tilde{x}}_{k}, \end{aligned}

which is the same functional form given in equation (10). The above arguments are repeated for the subsequent iterations illustrating that the SADA and provincial algorithms will converge to the same values of

{{\hat{Φ}}_{i, k}^{(p)}, {\hat{λ}}_{i, k}^{(p)}} = {{\hat{Φ}}_{i, k}^{(s)}, {\hat{λ}}_{i, k}^{(s)}}

when

t_{*}^{(i)}

is derived to ensure the solutions to the calibration problems without range constraints are equal.

References

UNCEB. Leaving no one behind: Equality and non-discrimination at the heart of sustainable development. New York: United Nations, 2017.

ADB. Practical guidebook on data disaggregation for the sustainable development goals. Mandaluyong City National Capital Region Phillipines: Asian Development Bank, 2021.

Statistics Canada. Canada is the first country to provide census data on transgender and non-binary people, www150.statcan.gc.ca/n1/daily-quotidien/220427/dq220427b-eng.htm Statistics Canada (2022, accessed 20 January 2023).

Falorsi

Donmez

Khalil

, et al. Alternative methods for disaggregating sustainable development goal indicators using survey data. Stat J IAOS 2022; 38: 611–623.

Deville

Särndal

. Calibration estimators in survey sampling. J Am Stat Assoc 1992; 87: 376–382.

Singh

Mohl

. Understanding calibration estimators in survey sampling. Surv Methodol 1996; 22: 107–116.

US Census Bureau. Guidance for economic census geographies users, www.census.gov/programs-surveys/economic-census/guidance-geographies/levels.html (2021, accessed 20 January 2023).

Devin

Verret

. The development of a variance estimation methodology for large-scale dissemination of quality indicators for the 2016 Canadian census long form sample. In: JSM Proceedings. Chicago, United States: Survey Research Methods Section, American Statistical Association, 2016, pp.1977–1991.

Including non-binary gender in the calibration strategy for the Canadian long-form sample survey weights

Abstract

Keywords

1 Introduction

2 Long-form estimation strategy

3.1 The provincial non-binary calibration problem

4 Monte Carlo simulation

4.1 Setup

4.1.1 Analysis

4.2 Results

∘ Almost all points are above the reference line meaning the MCCVs associated with non-binary calibration are smaller than those without non-binary calibration. Both calibration procedures produce estimated SEs which have a small (negative) MCRB. The two procedures are comparable. 5 Conclusion

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

Appendix

References