Fortified,assured,streamlined & trusted (FAST) services: Streamlined privacy protection in DataLabs

Abstract

The Australian Bureau of Statistics (ABS) is committed to improving microdata access while maintaining privacy and confidentiality through its virtual DataLab, which enables researchers to conduct complex analyses. Currently, DataLab research outputs must comply with strict disclosure rules before clearance, but the manual vetting process is cost-inefficient and error-prone. As output volumes grow across diverse projects, so too does the risk of differencing — even when individual outputs meet disclosure requirements. To address this, the ABS has been developing streamlined output protection by equipping safe users with protection tools and implementing automated vetting systems. These tools use an enhanced cellkey methodology, assigning unique random keys to each contributing record and applying protection based on aggregated keys within each table cell. This ensures consistent protection across projects sharing the same contributors. The ``same contributors, same noise'' feature mitigates differencing risks and reduces protection costs when applied universally, while vetting systems verify that outputs are generated using approved tools before dissemination. Our first contribution is a prototype ``Fortified, Assured, Streamlined, Trusted (FAST)'' output protection toolkit built in R and R Shiny to streamline DataLab vetting processes. We also developed a sequential descent optimisation algorithm supporting both asymmetric and symmetric perturbation distributions. Our method integrates $(\epsilon, \delta)$ differential privacy parameters directly into the noise distribution design used in the ABS perturbation methodology.

Keywords

perturbation differential privacy data access cellkey

1. Introduction

National statistics offices (NSOs) offer research data centre services to address the challenge of balancing better data access and confidentiality protection. For example, the United Kingdom’s Office for National Statistics offers the Secure Research Service, Statistics Canada offers Research Data Centres, in the US, there are Federal Statistical Research Data Centres, and Statistics Netherlands offers the Data & Development Lab. In Australia, the ABS DataLab provides a safe environment for researchers to access a variety of individual, household and business microdata in response to the need to balance better access to data and address confidentiality risks. The ABS DataLab is a data analysis solution for high-end data users who want to extract full value from ABS microdata.¹ The DataLab supports researchers to undertake complex research. DataLab sessions have grown significantly since 2020, rising from $13,279$ to a peak of $38,286$ in 2023. While 2024 saw a slight dip to $34,120$ sessions, usage remains well above pre $-$ 2023 levels, reflecting continued strong demand. The current ABS output clearance process relies heavily on manual checking, making it neither scalable nor cost-effective, and leaving it vulnerable to human error. A further concern is that the growing volume of outputs across multiple projects may introduce compounding disclosure risks — even where each output has individually satisfied the strict clearance criteria. This risk is amplified by integrated datasets, which can reveal richer details about individuals’ and businesses’ activities and associations than outputs from the non-integrated datasets would suggest.²

There are various approaches and methodologies for developing effective confidentiality tools to ensure safe outputs. This work builds on the ABS perturbation methodology, which comprises two components: the cellkey methodology, ensuring consistency in output protection, and the entropy maximisation methodology, which maximises uncertainty—making it more difficult for an attacker to “invert” the noise-while maintaining the utility of the statistical outputs.^3–5 We use the improved ABS perturbation methodology to develop these tools because it is widely accepted in Australia, and other national statistical offices also apply this approach to safeguard outputs (see Dove et al.⁶ for the United Kingdom and de Vries et al.⁷ for Portugal and Germany).

Our first contribution is the development of a prototype suite of output protection tools, termed “Fortified, Assured, Streamlined, Trusted (FAST)”, implemented in R. These tools build on prior research on output protection methods,^4,8–14 and are developed within an R Shiny framework. R Shiny was selected as the implementation platform due to the increasing adoption of R by NSOs for statistical production. As an open-source programming language, R facilitates transparency, collaboration, and the industrialisation of official statistics.^15,16

The FAST output protection prototype has been trialled for usability within the ABS DataLab and is expected to be rolled out more broadly later this year. Users can interact with the tools either programmatically via R or through a graphical user interface (GUI) built in R Shiny. This dual-interface design accommodates both advanced users and those with more limited programming expertise. The present version of FAST focuses primarily on count outputs, reflecting the fact that most outputs produced within the ABS DataLab are counts. Initial feedback indicates strong user interest in adopting the FAST output protection tools. Evidence of efficiency gains, including reductions in vetting time and manual checking, will be systematically assessed once the tools are deployed across all projects.

To further support NSOs in managing disclosure control workflows, the FAST suite includes a complementary system intended for NSO staff, FASTmanager, which streamlines the output vetting process to ensure the release of safe outputs — a key component for upholding the Five Safes principles.¹⁷ FASTmanager also provides mechanisms for the generation of random record keys required for the disclosure control method, and for the generation of the noise distributions used by FAST.

Our final contribution is the development of a sequential descent optimisation algorithm used to generate noise distributions for the FAST suite. This method accommodates both asymmetric and symmetric supports, while integrating $(ϵ, δ)$ differential privacy parameters in the noise generation process. Incorporating asymmetric supports is particularly important when released values are bounded below by zero. In such cases, the noise distribution must be asymmetric to preserve the validity of perturbed outputs while maintaining appropriate privacy protection. We build on the work of Bailie and Chien,¹³ Sadeghi and Chien¹⁴ by incorporating $(ϵ, δ)$ -DP parameters in the design of the noise distribution in the ABS perturbation methodology. This paper does not seek to determine the optimal choice of parameters for designing noise distributions, as NSOs operate under differing legislative and policy requirements for confidentiality. Instead, the approach provides flexibility to control the distribution according to specific confidentiality and analytical needs. The ABS’s interest in differential privacy lies in its potential to provide a principled and objective framework for balancing the trade-off between disclosure risk and data utility, thereby improving the accessibility of statistical information to users.

The paper is organised as follows: Section 2.1 provides a summary of the prototype tools, Section 2.2 describes the prototype output vetting tools , Section 3.1 describes our proposed improvements to the ABS perturbation methodology and Section 4 provides a conclusion and proposes future research directions.

2. FAST

Figure 1 provides an overview of the workflow and relationships between the different FAST tools. Three internal R packages have been developed as part of this project. The FAST tools are designed to operate within secure research environments, enabling researchers to generate outputs that are disclosure-protected prior to release, consistent with the workflows of research data centres at NSOs. In this respect, the tools are transferable to other NSOs with similar trusted access infrastructures.

Figure 1.

Proposed workflow.

NSO staff use the sequentialdescent package to design the noise distribution used by FAST. The FASTmanager package is used by NSO staff to manage parts of the process such as generating random record keys (Rkeys) and vetting the output received from users. Trusted users in the DataLab environment use the FAST package itself for producing confidentialised tables which are then submitted for clearance.

The sequential descent mechanism for designing the noise distribution, discussed further in 3.1, represents a general methodological contribution and could be adopted by other NSOs as a principled approach to noise generation under constrained output domains. However, key parameters within the FAST framework — such as noise distributions and protection thresholds — are inherently context-dependent. These should be calibrated by individual NSOs in accordance with their legislative requirements, confidentiality policies, and risk tolerances, reflecting differing trade-offs between data utility and disclosure risk across jurisdictions.

2.1. FAST package

FAST output protection tools allow users to produce confidentialised tables in DataLab which are automatically cleared for output, rather than going through the manual output clearance process. There is still an oversight process to ensure the tool has been applied correctly using the FASTmanager package, but this is a significant reduction in effort compared to the existing system of manual output clearance.

FAST implements an improved version of the well-established ABS perturbation methodology,¹⁴ which has been used in the ABS Census TableBuilder since 2006.⁸ The methodology comprises two main components. First, the cellkey method ensures a “same contributor, same noise” property: if the same individual appears in multiple tables, their contribution is perturbed consistently, preventing users from reconstructing confidential information through differencing attacks. Second, the magnitude of this noise is derived via an entropy maximisation procedure,^4,8–10 and the resulting distribution is encoded in a perturbation table with 256 rows. Example 2.1.1 shows a comparison between the improved version of the cellkey method and the original cellkey method.

The method uses four pseudo-random record keys ${Rkey}_{i}^{j}$ , $j = 1, \dots, 4$ , attached to each record of the input file, and combines them in a consistent way to produce a CellKey for each cell in the table. This CellKey is then used as a seed for choosing the perturbation to apply to that cell. The CellKey is calculated as

\begin{aligned} CellKey & = {CellKey}^{1} \oplus {CellKey}^{2} \oplus {CellKey}^{3} \\ \oplus {CellKey}^{4}, \end{aligned}

where

\oplus

is the bitwise exclusive-or operation (XOR), and the four components are calculated as

\begin{aligned} {CellKey}^{j} = \sum_{i \in Cell} {Rkey}_{i}^{j} mod bigN . \end{aligned}

The number bigN is a parameter of the method. In addition, we implement a memory-efficient quantile sampling approach, described in Section 5 in Sadeghi and Chien,¹⁴ to make full use of the noise distribution. This is critical for accurately capturing the tail behaviour of the noise distribution, which is essential for differential privacy guarantees. The noise distribution that is used is discussed in more detail in Section 3.1.

These tools intend to cater for researchers with different preferences - i.e., researchers who prefer to interact with the tools using programming or user interfaces. One important consideration is that not all NSOs have strong capabilities in developing user interfaces. There is a good argument to explore collaborative opportunities between NSOs, open source or commercial communities to continue enhancing the prototype capabilities and on-going development support. Currently, FAST can only cater for outputs of counts but there is ongoing work to extend it to other statistical outputs, e.g. magnitudes.

It is important to note that FAST tools are provided to trusted users who have been vetted through a governance process, enabling them to access the original microdata within DataLab. Consequently, there are no strict protection mechanisms to prevent these trusted users from viewing the parameters or noise distributions used in FAST. If user trust cannot be assumed, a web-based solution like ABS TableBuilder would be more appropriate, as it prevents users from accessing the underlying microdata or protection parameters.

Figure 2 shows the user interface for FAST, allowing users to produce confidentialised tables.

Figure 2.

FAST user-interface.

In addition, R Shiny also supports the development of a metadata tab to allow users explore the data assets, see Figure 3.

Figure 3.

FAST user-interface meta data tab.

Figure 4 shows how researchers can use an R program to confidentialise outputs.

Figure 4.

FAST R program.

The software demonstration is available online at \url{https://github.com/joseph-chien/FAST-demo}.

2.1.1. Example

Figure 5, adapted from Dove et al.,⁶ illustrates the current single-key perturbation methodology. Panel (a) shows that four contributors fall into the $0$ – $19$ male category, and each contributor is assigned a 32-bit record key $r k e y_{i}$ . These four keys are summed (with modulo arithmetic to avoid overflow), and their four 8-bit bytes are combined using a bitwise XOR to produce $cellKey = 45$ . Jones¹⁸ discusses how the XOR step prevents linear additivity between interior and margin cells. Panel (b) shows the perturbation table derived via the entropy maximisation algorithm, indexed by row and cell value. In this example, row $45$ and cell value $4$ yield $Z = - 1$ , so the published count equals $q (x^{'}) - 1$ .

Figure 5.

ABS perturbation method.

Figure 6 presents the FAST perturbation methodology using the same example and demonstrates how these methods achieve consistent results in this example. In general, we are not expecting these two methods to return exactly the same results. Panel (a) shows how four record keys $R k e y^{1}$ – $R k e y^{4}$ are summed to produce intermediate keys ${CellKey}^{1}$ – ${CellKey}^{4}$ , which are then combined via bitwise XOR and reduced modulo to obtain the final $CellKey$ . Using four 32-bit keys expands the possible values to approximately $2$ billion. Panel (b) shows that the FAST perturbation distribution is derived using a sequential descent algorithm, which allows users to incorporate $(ϵ, δ)$ -DP parameters directly into the noise construction. This provides a robust framework for NSOs to evaluate the utility–protection trade-off across different parameter choices.

Figure 6.

FAST perturbation method.

2.2. FASTmanager tools

To support the management of disclosure control workflows, we develop a complementary prototype system, FASTmanager, which streamlines the output vetting process to ensure the release of safe outputs. The system also provides a mechanism for generating random record keys based on unique identifiers, thereby minimising duplication.

While the output protection tools are deployed within a trusted user paradigm, the ABS retains responsibility for minimising processing errors and ensuring procedural compliance. For example, the ABS must verify that statistical outputs have been generated using the FAST output protection tools and that appropriate protection mechanisms have been applied prior to public release. To address this requirement, this research implements a hash-based validation system using digital signatures for output comma-separated value (CSV) files.

Figure 7 illustrates how the SHA $-$ 256 algorithm generates unique signatures for each CSV file by combining the filename, timestamp, and a cryptographic salt. The output vetting team can then use the FAST vetting tool to verify these signatures prior to approving outputs for release. This approach substantially reduces manual intervention in the vetting process while maintaining assurance that outputs have passed through the required protection workflow. There remains scope to further refine the system towards full automation. Other NSOs facing similar output vetting challenges may benefit from adopting hash-based validation approaches to streamline their release processes.

Figure 7.

A signature approach.

Consistency in generating record keys is required to uphold the “same contributors, same noise” principle. FASTmanager therefore incorporates a method for generating reproducible record keys based on a deterministic hash-based message authentication code (HMAC) with adaptive salting. This approach ensures reproducibility while minimising the risk of collisions (duplication) in large administrative datasets. The improved cellkey method substantially reduces the risk of duplication and is designed to make such occurrences negligible in practice.

The risk of such collisions is well illustrated by the classical birthday paradox from probability theory.¹⁹ The birthday paradox demonstrates that even when millions of distinct key values are available, random assignment can lead to duplicated values after only a few thousand draws. This highlights why naive random key generation may result in collisions in large administrative datasets, and why collision-resistant, deterministic hashing methods are required to ensure consistent and reproducible record keys. The proposed approach mitigates birthday-problem collisions in large administrative datasets (see Menezes et al.²⁰ and further details in Section Appendix E).

3. Confidentiality methodology improvements

3.1. Perturbation with asymmetric supports under $(ϵ, δ)$ -DP

The previous sections describe the prototype FAST output protection and the FASTmanager tools that are used to streamline the process. This section describes the methodology implemented in sequentialdescent package to create the noise distribution that is a key component of the FAST tools.

Adding consistent statistical noise to published outputs provides mathematically rigorous protection against re-identification in today’s data-saturated environment. Ideally, the noise distribution will be unbiased and produce non-negative perturbed counts to maximise the analytical utility of the perturbed outputs, and to meet the expectations of users. The ABS employs a perturbation technique that randomly adjusts all cell values by small amounts, introducing negligible bias while preserving the analytical utility of tables and safeguarding individual confidentiality in aggregate statistics.²¹ In recent years, the U.S. Census Bureau adopted differential privacy after internal research revealed that earlier disclosure avoidance methods were susceptible to database reconstruction attacks, enabling adversaries to link approximately half of the 2010 Census population to commercial datasets using publicly available information.²² The ABS’s interest in exploring differential privacy is to improve how we can better balance the trade off between risk and utility in a more objective manner while making information more accessible.

Incorporating differential privacy into the entropy maximisation framework provides a rigorous analytical method to quantify and compare privacy-utility trade-offs through the ( $ϵ$ , $δ$ )-DP parameters. As shown in Proposition 2 in Sadeghi and Chien,¹⁴ this analytical characterisation enables the optimal selection of perturbation variance to achieve a desired $δ$ for a given $ϵ$ and perturbation support. This guidance in designing the noise distribution allows practitioners to make informed choices about noise parameters rather than relying solely on empirical evaluation. Under the ( $ϵ$ , $δ$ )-DP framework, the confidentiality protection is quantified through two parameters: $ϵ$ and $δ$ . The privacy budget parameter $ϵ$ controls the strength of confidentiality protection, where smaller $ϵ$ values provide stronger privacy guarantees by limiting how much the perturbed outputs can differ between neighbouring datasets (those differing by a single record). The failure probability parameter $δ$ represents the probability that the $ϵ$ -differential privacy guarantee does not hold, so smaller $δ$ values reduce the likelihood of confidentiality breaches. Together, these parameters allow data custodians to make explicit trade-offs between privacy protection and data utility.

In the context of designing a noise distribution, a symmetric support would be one where the possible noise values are all integers bounded between $- D$ and $D$ for some value $D$ . Asymmetric supports are essential in practice to avoid producing negative perturbed outputs, or perturbed outputs that appear to pose disclosure risks. For instance, if the true count is $4$ , allowing a perturbation of $- 3$ would produce an output of $1$ , which could mislead researchers into believing they are viewing genuine small cells with potential identification concerns, when in fact these low values are artefacts of the perturbation. By restricting the support to exclude such problematic small counts in the output, the noise mechanism prevents this confusion, but this asymmetry fundamentally changes the optimisation problem used to design the noise distribution.

Our notation and exposition resembles.¹⁴ We use the natural logarithm throughout the paper. We consider a single counting query function $q$ from a dataset $x$ where $x$ is drawn from a universe of datasets $X$ . The true count is $q (x) = n$ . We draw a discrete-valued independent random variable $Z$ and probability mass function (pmf) $p_{Z}$ is added to the true count to give the random query response, $M (x)$

\begin{aligned} M (x) = q (x) + Z . \end{aligned}

(1)

The parameters of the noise pmf are assumed to be independent of the dataset

x

. We add the constraints that the outputs are non-negative, i.e.

M (x) \geq 0

, and that the noise is bounded by a maximum value

D

. This leads us to define the support of the noise distribution for a given true count

n

Z_{n} = {z_{i} \in [- D, D] ∣ n + z_{i} = 0 or n + z_{i} \geq k}

where

k

is a value that may lead to a high perceived disclosure risk. If the value of

n

is irrelevant, we may omit the subscript and write

Z

The ABS entropy maximisation method applies Jaynes’s³ principle by maximising uncertainty about individual values while preserving statistical integrity through an unbiased noise distribution, thereby minimising exploitable information while maintaining aggregate accuracy for valid inference. Marley and Leaver⁹ describe the entropy maximisation constraints proposed by Fraser and Wooton⁸ as:

\begin{aligned} max_{P_{Z}} H (Z) = max_{P_{Z}} \sum_{z_{i} \in Z} p (z_{i}) \log \frac{1}{p (z_{i})}, \end{aligned}

(2)

\begin{aligned} s.t. {\begin{cases} E [Z] = 0, & zero bias, \\ E [Z^{2}] \leq V, & variance constraint, \\ \sum_{z_{i} \in Z} p (z_{i}) = 1, & valid pmf, \\ p (z_{i}) \geq 0, \forall z_{i} \in Z, & valid pmf. \end{cases} \end{aligned}

(3)

Sadeghi and Chien¹⁴ make the following assumptions to analytically characterise and optimise the differential privacy performance of the entropy maximisation including: (1) focusing on the special symmetric case where $Z = [- D, D]$ for some $D \in N$ and (2) true count satisfies $q (x) = n$ with $n \geq D$ to ensure the random query output $M (x)$ remains non-negative.

Sadeghi and Chien¹⁴ specify the entropy maximisation optimisation problem in (2) as

\begin{aligned} max_{P_{Z}} H (Z) = max_{P_{Z}} \sum_{Z = - D}^{D} p (z_{i}) \log \frac{1}{p (z_{i})}, \end{aligned}

(4)

\begin{aligned} s.t. {\begin{cases} \sum_{Z = - D}^{D} z_{i} p (z_{i}) = 0, & zero bias, \\ \sum_{Z = - D}^{D} z_{i}^{2} p (z_{i}) \leq V, & variance constraint, \\ \sum_{z_{i} = - D}^{D} p (z_{i}) = 1, & valid pmf, \\ p (z_{i}) \geq 0, z_{i} \in [- D, D] & valid pmf. \end{cases} \end{aligned}

(5)

The entropy maximisation methodology with symmetric supports is

\begin{aligned} p (z_{i}) = C \exp (- γ z_{i}^{2}), z_{i} \in [- D, D], \end{aligned}

(6)

where

C

is the normalisation constant that satisfies

\begin{aligned} \sum_{z_{i} = - D}^{D} C \exp (- γ z_{i}^{2}) = 1 \end{aligned}

and is given by

\begin{aligned} C = \frac{1}{\sum_{z_{i} = - D}^{D} \exp (- γ z_{i}^{2})} . \end{aligned}

(7)

Under the formulation in Equation (6), the noise design has no bias, i.e. $E [Z] = 0$ , and this is achieved because the supports are symmetric. However, depending on the design, it becomes much more difficult (sometimes impossible) to achieve unbiasedness when the supports are asymmetric.

Under the more general formulation in Equation (2), with an asymmetric support $Z$ , the noise distribution has the following form:

\begin{aligned} p (z_{i}) & = C \exp (- γ z_{i}^{2} - β z_{i}), z_{i} \in Z, \end{aligned}

(8)

where

C

is the normalisation constant given by

\begin{aligned} C & = \frac{1}{\sum_{z_{i} \in Z} \exp (- γ z_{i}^{2} - β z_{i})} . \end{aligned}

(9)

Equation (8) now has two parameters: $γ$ and $β$ . We develop a sequential descent algorithm that searches for the best $γ$ and $β$ that meet the desired $ϵ$ (the privacy budget parameter that quantifies the privacy loss, where smaller $ϵ$ provides stronger confidentiality protection).¹ The algorithm accommodates both symmetric and asymmetric supports, providing flexibility to control perturbed outputs according to specific confidentiality and analytical needs.

The sequential descent algorithm uses the method of Sadeghi and Chien¹⁴ to create a symmetric noise distribution for a given true count $n$ as a starting point.² The starting value of $n$ in our current method is based on the maximum noise value $D$ . Since the perturbed count is restricted to be non-negative, at the true count $n = D$ this is equivalent to having no restriction, and the support of the noise distribution is symmetric. The algorithm then creates noise distributions for the rest of the true counts, i.e. $n = D - 1, D - 2, \dots, 2, 1, 0$ . For $n \geq D$ all true counts have the same noise distribution.

The sequential descent algorithm provides the flexibility to use asymmetric supports while ensuring unbiasedness (i.e., $\sum_{z_{i} \in Z} z_{i} p (z_{i}) = 0$ ), which is difficult to achieve analytically for imbalanced supports. The algorithm has two key steps. First, it uses a bisection method to search for the corresponding value of $β$ that satisfies the unbiasedness constraint for a given $γ$ and asymmetric support.

To justify using the bisection method to find a value of $β$ that makes the distribution unbiased (i.e., $\sum_{z_{i} \in Z} z_{i} p (z_{i}) = 0$ ), we can demonstrate that there always exist starting points for the bisection method that have a positive bias and a negative bias.

In our framework, achieving a desired ( $ϵ$ , $δ$ ) pair while maintaining unbiasedness requires careful selection of the noise distribution parameters $γ$ and $β$ , particularly when supports are asymmetric. The following proposition establishes a key property that enables our sequential descent algorithm to find appropriate parameter values. We introduce the following notation: $ϵ_{t a r g e t}$ and $δ_{t a r g e t}$ denote the target privacy parameters specified by users, while $ϵ_{m e a s u r e d}$ and $δ_{m e a s u r e d}$ denote the corresponding values calculated by the algorithm for given $γ$ and $β$ .

Proposition 1

If the support $Z \subset [- D, D]$ contains at least one negative value $z_{*} \leq - 1$ , and if $γ \geq 0$ , then there exists some $β$ such that $b i a s (β, γ) < 0$ , where

\begin{aligned} b i a s (β, γ) = \sum_{z_{i} \in Z} z_{i} C \exp (- γ z_{i}^{2} - β z_{i}) . \end{aligned}

The proof of this statement can be found in Appendix B.

Second, the algorithm uses a line search to find a value of $γ$ such that the noise distribution for $n$ is sufficiently close to the noise distribution for $n + 1$ , then continues sequentially for $n - 1$ . To decide whether the distributions are sufficiently close, we use a modified formula from Bailie and Chien¹³ to calculate

\begin{aligned} ϵ_{m e a s u r e d} = max_{z_{i}} | \log \frac{p (z_{i} - 1 ∣ n + 1)}{p (z_{i} ∣ n)} | \end{aligned}

where the maximum is taken over the set

{z_{i} \in Z_{n} : z_{i} - 1 \in Z_{n + 1}}

. If

ϵ_{m e a s u r e d}

is less than the target value of

ϵ_{t a r g e t}

then the noise distributions are sufficiently close.

This only guarantees the $ϵ$ -DP bound when the perturbed count is in the range $n - D + 1$ to $n + D$ , but the probabilities at the extremes, $p (D ∣ n + 1)$ and $p (- D ∣ n)$ , might be positive. If we calculate $δ_{m e a s u r e d}$ as

\begin{aligned} δ_{m e a s u r e d} = max (p (D ∣ n + 1), p (- D ∣ n)), \end{aligned}

and it is less than the target value of

δ

, then

\begin{aligned} p (z_{i} ∣ n) & \leq e^{ϵ} p (z_{i} - 1 ∣ n + 1) + δ, and \end{aligned}

(10)

\begin{aligned} p (z_{i} - 1 ∣ n + 1) & \leq e^{ϵ} p (z_{i} ∣ n) + δ \end{aligned}

(11)

for all values of

z_{i}

, even outside of the supports

Z_{n}

and

Z_{n + 1}

In practice, we don’t set a target $δ_{t a r g e t}$ , we just select $γ$ with the smallest $δ_{m e a s u r e d}$ subject to $ϵ_{m e a s u r e d} < ϵ_{t a r g e t}$ . It is important to consider the corresponding $δ_{t a r g e t}$ in the design (see Appendix A for a discussion).

The step down process in the sequential descent algorithm is described below:

We would like to note that under the constraints of the noise being unbiased and the perturbed counts non-negative, a true count of $n = 0$ cannot have any noise added to it. This means it is impossible to achieve a small $δ_{m e a s u r e d}$ between $n = 0$ and $n = 1$ . In addition, perturbing $0$ makes it difficult to minimise the release of nonsensical statistical information if the total query space is large.

We also implement an alternative version of the algorithm using parallel computation to speed up the search for $γ$ (see Algorithm 1 in Appendix D for a detailed description).

4. Conclusion

Research data centre services, such as the ABS DataLab, provide secure environments that enable trusted researchers to extract value from rich microdata while maintaining confidentiality. This research contributes to strengthening NSOs capabilities by supporting more timely, consistent, and cost-effective output clearance processes. In particular, the extended use of the cellkey method promotes consistent protection across outputs and reduces differencing risks arising from similar queries across projects using the same underlying data.

This paper makes two key contributions to disclosure control for official statistics. First, we develop a practical framework for output protection, including the FAST and FASTmanager prototypes, which together support researchers and output vetting teams. These tools facilitate the generation and verification of protected outputs through programmatic and graphical interfaces, while streamlining compliance and reducing manual vetting effort within NSOs. While FAST is currently limited to confidentialising tables of counts, these represent a majority of the outputs produced by researchers in the ABS DataLab.

Second, we propose a sequential descent algorithm for constructing symmetric and asymmetric noise distributions that satisfy $(ϵ, δ)$ -differential privacy. This new methodology is used by the FASTmanager package. This provides a flexible and principled approach for designing perturbation mechanisms that better reflect real-world constraints, such as bounded outputs, while maintaining an appropriate balance between confidentiality protection and analytical utility.

Taken together, these contributions demonstrate how modern confidentiality protection methods can be operationalised within existing NSO infrastructures. They provide a pathway towards more scalable, transparent, and efficient output release processes, while maintaining the trust and confidentiality guarantees that underpin official statistics.

Several promising directions for future research include:

Extending the protection to different statistical outputs beyond counts, such as magnitude or regression outputs.

Exploring alternative assumptions in the sequential descent algorithm for noise generation. For example, relaxing the zero bias constraint in 4 may facilitate achieving target $ϵ$ values more efficiently.

Investigating high-performance database management systems such as DuckDB²³ to accelerate processing of large-scale data in FAST tools.

Extending beyond the trusted user paradigm by developing web-based service implementations that restrict access to underlying microdata and protection parameters.

Integrating knowledge graphs, large language models, and on-the-fly confidentiality mechanisms to create novel capabilities for data access and protection.

Footnotes

Acknowledgements

We gratefully acknowledge our colleagues at the Australian Bureau of Statistics, including Anders Holmberg, Andrew McMahon, Kristen Stone, Kelly Chiu, Helen Teasdale, Humaira Khan, Wolfgang Hertel, Isaac Norden, Sam Lu, Marcus Robertson-Wall, Cedric Wong, Vi Vu, and Chris Mann, as well as colleagues from the Data Access and Confidentiality Methodology Unit, for their valuable comments and their support of the FAST project. We also thank the anonymous reviewers for their constructive feedback, which has contributed to improving the manuscript.

ORCID iDs

Chien-Hung Chien

Aymon Wuolanne

Ethical Approval and Informed Consent

Ethical approval and informed consent were not required for this research.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Data is not used for this research.

Authorship statement

All persons who meet authorship criteria are listed as authors, and all authors certify that they have participated sufficiently in the work to take public responsibility for the content.

Additional information for editorial office

This title page contains identifying institutional and funding information for administrative purposes. A fully anonymised version has been prepared for peer review if required.

Notes

5.

We use the following example to illustrate a worst-case (maximum perturbation) attack from known support bounds to provide context on why it is important to choose a low $δ$ in the design of the noise distribution. The attackers can use logical relationships between released perturbed aggregates to substantially reduce uncertainty and, in this case, fully reconstruct a sensitive subgroup count. Notably, the probability of success of this maximum perturbation attack becomes lower when the noise distributions have a low $δ$ .

We consider a simple counting query with additive noise bounded within a known support. Let the perturbation mechanism add an integer-valued noise variable $D \in {- 3, - 2, - 1, 0, 1, 2, 3}$ . We consider a uniform distribution and a low-delta (concentrated) distribution.

Appendix B. Proof of Proposition 1

A similar argument shows that there exists a $β$ such that $b i a s (β, γ) > 0$ as long as there is some positive value $z_{*}$ in the support $Z$ . In that case the relevant bound for $β$ is $β < - γ z_{*} - \frac{1}{z_{*}} \log (\frac{1}{2} D (D + 1))$ .

Appendix C. Proof that variances are equivalent between sequential descent and entropy maximisation

We need the expected value $E [z_{i}]$ and the expected value of $z_{i}^{2}$ , $E [z_{i}^{2}]$ to compute the variance. Expected Value $E [z_{i}]$ is derived as:

\begin{aligned} E [z_{i}] = \sum_{z_{i} = - D}^{D} z_{i} p (z_{i}) = \sum_{z_{i} = - D}^{D} z_{i} C \exp (- γ z_{i}^{2} - β z_{i}) . \end{aligned}

Substitute the normalisation constant

C

\begin{aligned} E [z_{i}] = \sum_{z_{i} = - D}^{D} z_{i} (\frac{\exp (- γ z_{i}^{2} - β z_{i})}{\sum_{z_{i} = - D}^{D} \exp (- γ z_{i}^{2} - β z_{i})}) . \end{aligned}

The expected value of

z_{i}^{2}

E [z_{i}^{2}]

can be derived as:

\begin{aligned} E [z_{i}^{2}] = \sum_{z_{i} = - D}^{D} z_{i}^{2} p (z_{i}) = \sum_{z_{i} = - D}^{D} z_{i}^{2} C \exp (- γ z_{i}^{2} - β z_{i}) . \end{aligned}

Substitute the normalisation constant

C

\begin{aligned} E [z_{i}^{2}] = \sum_{z_{i} = - D}^{D} z_{i}^{2} (\frac{\exp (- γ z_{i}^{2} - β z_{i})}{\sum_{z_{i} = - D}^{D} \exp (- γ z_{i}^{2} - β z_{i})}) . \end{aligned}

The variance

σ^{2}

is given by:

\begin{aligned} σ^{2} = E [z_{i}^{2}] - (E [z_{i}])^{2} . \end{aligned}

The probability distribution from entropy entropy maximisation method proposed by Fraser and Wooton⁸ is:

\begin{aligned} p (z_{i}) = Z \exp (- λ_{1} z_{i}^{2} - λ_{2} z_{i}), \end{aligned}

where

Z

is the normalisation constant:

\begin{aligned} Z = \frac{1}{\sum_{z_{i} = - D}^{D} \exp (- λ_{1} z_{i}^{2} - λ_{2} z_{i})} . \end{aligned}

Appendix D. Sequential descent algorithm in parallel computation

Appendix E. Record key generation based on unique identifiers

References

Parker

. The dataLab of the Australian bureau of statistics. Aust Econ Rev 2017; 50: 478–483.

Productivity Commission. Data availability and use, 2017. https://www.pc.gov.au/inquiries/completed/data-access/report/data-access.pdf.

Jaynes

. Information theory and statistical mechanics. Phys Rev 1957; 106: 620.

Leaver

. Implementing a method for automatically protecting user-defined census tables. Joint ECE/Eurostat Workses Stat Confident Bilbao (December 2009) 2009; WP.22: 1–8.

Dwork

Roth

. The algorithmic foundations of differential privacy. Foundat Trend Theoret Comput Sci 2014; 9: 211–407.

Dove

Ntoumos

Spicer

. Protecting census 2021 origin-destination data using a combination of cell-key perturbation and suppression. In: International conference on privacy in statistical databases, 2018, pp.43–55. Springer.

de Vries

Marieke

de Wolf

Peter-Paul

Golmajer

Manca

et al. An overview of used methods to protect the European Census 2021 tables. Joint UNECE/Eurostat Work Ses Stat Data Confid, 26–28 September 2023 Wiesbaden, Germany 2023: 1–15.

Fraser

Wooton

. A proposed method for confidentialising tabular output to protect against differencing. Monograp Official Stat: Work Ses Stat Data Confident 2005; WP. 35: 1–6.

Marley

Leaver

. A method for confidentialising user-defined tables: statistical properties and a risk-utility analysis. In: Proceedings of the 58th congress of the international statistical institute, ISI, 2011, pp.21–26.

10.

Thompson

Broadfoot

Elazar

. Methodology for the automatic confidentialisation of statistical outputs from remote servers at the Australian bureau of statistics. Joint UNECE/Eurostat work Ses Stat Data Confident 2013; Working Paper: 1–37.

11.

O’Keefe

Ayre

Lucie

, et al. Perturbed robust linear estimating equations for confidentiality protection in remote analysis. Stat Comput 2017; 27: 775–787.

12.

Khan

O’Keefe

. Disclosure risk reduction for generalized linear model output in a remote analysis system. Data Knowl Eng 2017; 111: 90–102.

13.

Bailie

Chien

. ABS perturbation methodology through the lens of differential privacy. Joint UNECE/Eurostat Work Ses Stat Data Confident The Hague, Netherlands. 2019: 1–13.

14.

Sadeghi

Chien

. On the connection between the ABS perturbation methodology and differential privacy. J Privacy Confident 2024; 14: 1–19.

15.

Templ

Todorov

. The software environment R for official statistics and survey methodology. Aust J Stat 2016; 45: 97–124.

16.

ten Bosch

de Jonge

. Access to official statistics from R: an overview. In: The R Project - The Use of R in Official Statistics - uRos2023. 2023.

17.

Desai

Ritchie

Welpton

. Five safes: designing data access for research. Econom Work Paper Ser 2016; 1601: 28.

18.

Jones

. Statistical disclosure control for caribbean census tables: A proposal to expand the availability of disaggregated census data. Stud Persp–ECLAC Subreg Headquart The Caribbean 2021.

19.

Goldwasser

Bellare

. Lecture notes on cryptography, 2008. https://cseweb.ucsd.edu/mihir/papers/gb.pdf. Summer course “Cryptography and computer security” at MIT, pp.138–148. (1996, accessed: 23 October 2025).

20.

Menezes

van Oorschot

Vanstone

. Handbook of Applied Cryptography. Boca Raton: CRC Press, 1996.

21.

ABS. Confidentiality and relative standard error. https://www.abs.gov.au/statistics/microdata-tablebuilder/tablebuilder/confidentiality-and-relative-standard-error, (2021, Accessed: 20 October 2025).

22.

US Census Bureau. Census Bureau adopts cutting-edge disclosure avoidance technique for 2020 census. https://www.census.gov/newsroom/blogs/random-samplings/2019/02/census_bureau_adopts.html, (2019, accessed: 20 October 2025).

23.

Raasveldt

Mühleisen

. DuckDB: an embeddable analytical database. In: Proceedings of the 2019 international conference on management of data. 2019, pp.1981–1984.

24.

Asghar

Kaafar

. Averaging attacks on bounded noise-based disclosure control algorithms. Proc Privacy Enhan Technol 2020; 2: 358–378.

Fortified,assured,streamlined & trusted (FAST) services: Streamlined privacy protection in DataLabs

Abstract

Keywords

1. Introduction

2. FAST

3.1. Perturbation with asymmetric supports under ( ϵ , δ ) -DP

Footnotes

Acknowledgements

ORCID iDs

Ethical Approval and Informed Consent

Funding

Declaration of conflicting interest

Data availability statement

Authorship statement

Additional information for editorial office

Notes

5.

Appendix B. Proof of Proposition 1

Appendix C. Proof that variances are equivalent between sequential descent and entropy maximisation

Appendix D. Sequential descent algorithm in parallel computation

Appendix E. Record key generation based on unique identifiers

References

3.1. Perturbation with asymmetric supports under $(ϵ, δ)$ -DP