LLM-assisted record linkage: A framework for official statistics

Abstract

National statistical offices (NSOs) increasingly rely on record linkage to link census data, administrative sources, and survey responses. However, conventional string-similarity methods often struggle with free-text fields. To address these challenges, this paper systematically benchmarks modern open-source large language models (LLMs) against classic string-based comparators for record linkage. Building on these findings, this paper introduces a hybrid approach that retains well-established probabilistic frameworks yet integrates an LLM-based classifier for ambiguous record pairs. A Bayesian update is applied to combine the LLM's output with the prior probability, with the aim of reducing the burden on manual clerical review. The experiments show that selectively deploying open-source LLMs for the most uncertain pairs can significantly reduce manual effort by refining decisions through Bayesian updating. As NSOs must ensure transparency, explainability, and adherence to official statistical standards, this paper systematically addresses these concerns while evaluating the potential of LLMs for record linkage. Practical considerations including secure on-premises deployment, computational cost, human-in-the-loop review, and calibration are discussed to support responsible adoption in official statistics.

Keywords

uncertainty quantification data privacy large language model national statistical offices record linkage

1 Introduction

Record linkage (also called entity resolution) is the process of determining whether two or more records refer to the same underlying real-world entity, often in the absence of unique identifiers, by combining evidence across multiple fields. For National statistical offices (NSOs), linkage underpins the production of integrated statistics by enabling the use of administrative registers and by reducing respondent burden. NSOs routinely link census, survey, and administrative records to improve data quality and reduce respondent burden. Increasingly, NSOs use administrative registers to reduce respondent burden and lower data-collection costs, resulting in more free-text fields and datasets lacking unique identifiers. Accurate record linkage is therefore more critical than ever. The Fellegi–Sunter (FS) framework,¹ is a long-standing, trusted method in many national statistical offices. In the Fellegi–Sunter (FS) framework, comparison outcomes across fields are converted into weights using m- and u-probabilities, where m is the probability of observing a comparison outcome given a true match and u is the probability of observing that outcome given a true non-match. These weights can be combined to produce an overall match score or calibrated match probability for each record pair. Under this framework, comparators like Jaro–Winkler² produce a similarity score that is converted into a weight using precomputed probabilities. These weights reflect the likelihood of agreement ( $m$ and u probabilities in FS theory). However, string similarity measures surface-level character similarities and cannot easily interpret acronyms, abbreviations, or variations in free-text fields.

Recently LLMs have demonstrated superior performance compared to the average human on several widely recognized standardized assessments. GPT-4 surpassed average human performance on Scholastic Assessment Test (SAT), Law School Admission Test (LSAT), and math competition scoring a 95% accuracy rate on SAT math exam, and 92.5% accuracy on the English test of the Chinese national college entrance exam.^3,4 Despite impressive performance (and considerable hype), LLMs are resource-intensive and produce incorrect outputs. Furthermore, LLMs introduce data privacy and safety concerns especially in the context of NSOs handling sensitive data.

This paper takes a pragmatic view, treating LLMs as tools that can be valuable in certain scenarios but may be unfeasible or error-prone in others. Table 1 illustrates examples where standard string-similarity scores (fuzzy matching) can struggle, while an LLM-based classifier captures the underlying semantic match more effectively. In the LLM column, ‘1’ indicates that the model considers the two entries to be the same entity, whereas ‘0’ indicates they refer to different entities. The similarity scores in Table 1 are descriptive (not calibrated match probabilities) and are intended to build intuition rather than serve as a formal evaluation; calibrated match probabilities and threshold-based routing are introduced in Section 2.

Table 1.
Illustrative examples: abbreviations and free-text variants.

Dataset A Dataset B Levenshtein similarity Jaro–Winkler similarity LLM decision

GC Government of Canada 0.18 0.51 1

RCMP Royal Canadian Mounted Police 0.24 0.61 1

CRA Canada Revenue Agency 0.25 0.58 1

RCM Royal Canadian Mint 0.27 0.59 1

RCMP RCM 0.85 0.94 0

Dataset A	Dataset B	Levenshtein similarity	Jaro–Winkler similarity	LLM decision
GC	Government of Canada	0.18	0.51	1
RCMP	Royal Canadian Mounted Police	0.24	0.61	1
CRA	Canada Revenue Agency	0.25	0.58	1
RCM	Royal Canadian Mint	0.27	0.59	1
RCMP	RCM	0.85	0.94	0

Section 1.1 reviews how LLMs work and where they are most useful, followed by data privacy and safety concerns in Section 1.2. Throughout this paper modern open-source LLMs deployable in a secure NSO environment are utilized. Section 2 proposes a hybrid solution (Figure 1) which incorporates estimates from an LLM-based classifier for the most ambiguous record pairs while explicitly accounting for the probability that LLM outputs can be wrong. The first layer uses Fellegi–Sunter to compute a baseline probability of match/non-match for every record pair. For ambiguous cases an LLM-based classifier is invoked. The LLMs output are fused with the Fellegi-Sunter prior probability via Bayesian updating. By treating the LLM's classification as a new piece of (imperfect) evidence, this reduces manual clerical review while maintaining transparency and statistical rigour.

1.1 Introduction to large language models (LLMs)

LLMs are neural networks with hundreds of millions to billions of parameters, trained to predict the next token in a sequence given the preceding context. They rely on the transformer architecture, which uses ‘self-attention’ to compute discrete probability distributions over possible tokens. During pre-training, an LLM ingests massive text corpora such as web pages, books, and news articles and learns statistical patterns of language. This process adjusts the model's internal weights to maximize the probability of correct next-token predictions. Modern LLMs often undergo supervised fine-tuning and reinforcement learning from human feedback. Due to the scale of training data (often terabytes of text) and model size, LLMs like GPT-3 (175 billion parameters) or Google's PaLM (540 billion) can capture subtle linguistic patterns and factual associations.⁵ In practical terms, an LLM can take a prompt (some input text) and continue it or respond in a contextually appropriate way. Modern LLMs have proven to be versatile: the same model can answer questions, translate languages, summarize documents, or engage in dialogue, simply by being given different prompts.⁶ This is a key difference from traditional statistical machine learning models which are typically trained to perform just one specialized task.

Figure 1.

Overview of the hybrid record linkage framework integrating probabilistic matching with LLM-assisted classification for uncertain cases.

Under the hood, an LLM represents text as sequences of tokens (words or subword pieces) and produces a conditional probability distribution for the next token at each step, given the immediately preceding text. For instance, when asked, “Do these two records refer to the same entity?” an LLM assigns logits to “Yes” vs. “No”. While this ability to generate a probability distribution is powerful, it must be approached with caution and rigour. LLMs are not infallible oracles, and they can produce confident yet incorrect outcomes. Consequently, these predictions must be calibrated and verified to account for potential errors. This is especially important in the context of NSOs, where strict data governance constraints are paramount. The Bayesian framework in Section 2 addresses precisely this need.

1.2 Transparency, explainability and official statistical standards

National statistical offices operate under strict mandates of accuracy, reproducibility, and confidentiality. Although many people associate LLMs with cloud-based chatbots that could pose data-security risks, these concerns are mitigated by leveraging open-source LLMs that run entirely on-premises. Conceptually an open-source LLM is simply a large file of parameter matrices, similar to installing an open-source software library. This means all the model's matrix operations and data flows remain under the NSO's direct control and no external servers are involved. This ensures that no confidential information ever leaves the agency's secure environment. This approach not only preserves confidentiality but also enables precise audit trails by logging every classification request. All outputs are fully repeatable because the model's parameters and randomness seeds remain fixed within the agency's secure environment. (Section 3 provides the specific model variants used in the experiments.)

2 Methodology

This paper presents a multi-step approach that integrates traditional probabilistic framework (Fellegi–Sunter) with selective LLM usage. First, FS (or any linkage method that outputs calibrated match probabilities) produces an initial prior match probability, $P_{p r i o r} (x, y)$ , for each record pair based on string-comparison features (e.g., Jaro–Winkler similarity).

Two thresholds are then specified:

A lower threshold $T_{L}$

An upper threshold $T_{U}$

Record pairs with a probability

P_{p r i o r} (x, y)

below

T_{L}

can be classified as non-matches, while those above

T_{U}

can be deemed matches. Any pair with a probability in (

T_{L}

T_{U}

) is flagged as ambiguous (the gray zone) and normally requires human review. Generally, in practice, a small value (e.g., 0.35) is chosen for

T_{L}

, and a large one (e.g., 0.85) for

T_{U}

. In this framework, only gray-zone pairs are routed to the LLM. Calibrated probabilities below

T_{L}

represent high-confidence non-matches; routing those pairs would add computational cost with limited expected benefit and could increase the risk of false matches.

2.1 Mathematical framework

Let $H \in {0, 1$ } be the latent variable indicating whether two records $(x, y)$ refer to the same underlying entity. $H = 1$ (or match) means they refer to the same entity and $H = 0$ (non-match) means they do not.

2.1.1 Baseline match probability

String-comparison features (e.g., Levenshtein distance⁷ and Jaro–Winkler similarity) are used within the Fellegi–Sunter (FS) framework to compute an initial match probability:

P_{p r i o r} (x, y) = P (H = 1 x, y)

(1)

Although FS is used here, any linkage method that outputs calibrated match probabilities could be substituted. The value $P_{p r i o r} (x, y)$ captures the initial belief that $(x, y)$ refers to the same entity, based on the chosen string comparison weights.

2.1.2 Identifying high-uncertainty pairs

Any pair whose probability $P_{p r i o r} (x, y)$ falls within $(T_{L}, T_{U})$ is flagged as ambiguous (the gray zone) and routed to the LLM. This approach is inspired by active learning, which prioritizes “informative” cases.⁸ By focusing only on these uncertain pairs, computational costs are controlled and the LLM's impact is maximized.

2.1.3 LLM output

For pairs selected for LLM processing, open-source Llama 3 models are used as the LLM classifier.⁶ The model is prompted with ‘Do these two records refer to the same entity? Answer YES or NO’ and map the response to 1 or 0. Let

\begin{aligned} D \in {0, 1} \end{aligned}

denote the model's binary decision on whether

(x, y)

represents the same entity. A non-symmetric error model is used:

\begin{aligned} θ_{1} & = P (D = 1 H = 1), (true positive rate or TPR) \end{aligned}

\begin{aligned} θ_{0} & = P (D = 0 H = 0), (true negative rate or TNR) \end{aligned}

Hence, if $H = 1$ , there is a $1 - θ_{1}$ chance the LLM incorrectly outputs 0; likewise, when $H = 0$ , there is a $1 - θ_{0}$ chance the LLM incorrectly outputs 1.

2.1.4 Bayesian probability updating

After observing the LLM's binary output $D$ for the pair $(x, y)$ , a posterior match probability via Bayes’ rule⁹ is computed.

Case 1: $D = 1$

\begin{aligned} P (H = 1 | D = 1, x, y) \\ = \frac{P (D = 1 | H = 1, x, y) P (H = 1 | x, y)}{P (D = 1 | x, y)} \end{aligned}

(2)

Since

\begin{aligned} P (D = 1 | H = 1, x, y) = P (D = 1 | H = 1) = θ_{1} \end{aligned}

and

\begin{aligned} P (H = 1 | x, y) = P_{p r i o r} (x, y) \end{aligned}

The marginal probability is:

\begin{aligned} P (D = 1 | x, y) = θ_{1} p_{p r i o r} (x, y) + (1 - θ_{0}) [1 - p_{p r i o r} (x, y)] \end{aligned}

Putting these together:

\begin{aligned} P (H = 1 | D = 1, x, y) \\ = \frac{θ_{1} \cdot P_{p r i o r} (x, y)}{θ_{1} \cdot P_{p r i o r} (x, y) + (1 - θ_{0}) [1 - P_{p r i o r} (x, y)]} \end{aligned}

(3)

Case 2: D = 0

The marginal probability is:

\begin{aligned} P (D = 0 | x, y) = (1 - θ_{1}) P_{p r i o r} (x, y) + θ_{0} [1 - P_{p r i o r} (x, y)] \end{aligned}

The posterior probability is:

\begin{aligned} P (H = 1 | D = 0, x, y) \\ = \frac{(1 - θ_{1}) P_{p r i o r} (x, y)}{(1 - θ_{1}) P_{p r i o r} (x, y) + θ_{0} [1 - P_{p r i o r} (x, y)]} \end{aligned}

(4)

Here, $θ_{1}$ and $θ_{0}$ represent the LLM's accuracy, which are determined empirically in the next section by testing the model on a labeled dataset with known ground truths.

Theoretically these parameters can also be set based on prior beliefs about the model's performance. However, their values also depend on factors such as the LLM's size, the complexity of the data, the quality of prompts, and whether the model was fine-tuned.

It is therefore recommended estimating $θ_{1}$ and $θ_{0}$ by testing the model on a small dataset with known ground-truth labels. As a result, LLM outputs are treated as imperfect but informative evidence. These outputs are then combined with the prior probability $P_{p r i o r} (x, y)$ to compute an updated posterior match probability.

2.2 Human-in-the-loop review

Certain pairs will remain inherently ambiguous, particularly where minimal contextual data exist or where text entries are outright contradictory. In official statistics, human-in-the-loop (HITL) workflows provide a crucial safety net. The proposed framework does not aim to eliminate or replace human judgment; instead, it aims to reduce manual workload so that human experts can focus on the most complex edge cases. The LLM serves as an automated triage step prior to human intervention, identifying cases that can be resolved automatically and flagging residual uncertainty. The remaining hardest cases are then escalated to clerical review for final quality control. This ensures accuracy, maintains transparency, and aligns with longstanding practices within national statistical offices.

3 Experimental results and discussion

3.1 Dataset

In official statistics, record linkage typically involves large datasets with no definitive “ground truth,” requiring agencies to rely on threshold-based decisions and selective clerical review. However, to rigorously evaluate new linkage methods, researchers often create curated datasets with verified matches and non-matches, enabling precise measurements of precision, recall, and related metrics. Following that approach, a set of 1000 real pairs of Canadian business records was compiled —balanced to include 500 confirmed matches and 500 verified non-matches. To ensure the non-matches were challenging, for each real business, the single most similar record that was confirmed as a distinct entity were identified. Although this controlled design does not capture every operational constraint that national statistical offices face, it provides clear empirical evidence of how an LLM-based approach can enhance record linkage on messy, free-text data, and suggests a viable route for selective large-scale deployment with minimal human oversight.

The experiments in this paper use the open-weight Llama 3 family of models (1B, 3B, and 8B parameter variants) deployed on-premises within a secure environment. Inference is performed locally so that record data do not leave the NSO network, and deterministic decoding settings (e.g., temperature 0) can be used to support repeatable outputs for audit purposes. The availability of open weights enables transparency, inspection, and local adaptation (e.g., fine-tuning) within the agency's environment. These models have also undergone assessment by Statistics Canada's IT security team, supporting their suitability for this application.

3.2 Results

Having established a balanced dataset, this section evaluates how different matching techniques, ranging from simple string comparators to LLMs, perform on this set. Table 2 shows that an 8B-parameter Llama model achieves an F1-score of 0.87, higher than Jaro–Winkler and Levenshtein. However, this improvement comes at a clear computational cost: on CPU, the 8B Llama model required around 179.5 min for 1000 comparisons, compared to just seconds for basic string comparators. Using a GPU cuts LLM inference time by roughly 70%, but it remains substantial compared to classic similarity measures.

Table 2.
Performance and approximate runtimes for 1000 pairs.

Method Precision Recall TNR F1-score Approx. runtime (CPU) Approx. runtime (GPU)

Jaro–Winkler 0.78 0.69 0.81 0.73 < 1 s < 1 s

Levenshtein 0.68 0.64 0.70 0.66 < 1 s < 1 s

LLM (1B params) 0.75 0.76 0.75 0.76 ∼31.4 min ∼9.2 min

LLM (3B params) 0.79 0.82 0.78 0.80 ∼61.7 min ∼17.6 min

LLM (8B params) 0.88 0.86 0.88 0.87 ∼179.5 min ∼52.3 min

Method	Precision	Recall	TNR	F1-score	Approx. runtime (CPU)	Approx. runtime (GPU)
Jaro–Winkler	0.78	0.69	0.81	0.73	< 1 s	< 1 s
Levenshtein	0.68	0.64	0.70	0.66	< 1 s	< 1 s
LLM (1B params)	0.75	0.76	0.75	0.76	∼31.4 min	∼9.2 min
LLM (3B params)	0.79	0.82	0.78	0.80	∼61.7 min	∼17.6 min
LLM (8B params)	0.88	0.86	0.88	0.87	∼179.5 min	∼52.3 min

Performance is reported using precision, recall, true negative rate (TNR), and the F1-score. Precision and recall summarize match detection quality, TNR summarizes non-match detection quality, and the F1-score balances precision and recall in a single metric. Runtime is reported for inference on a central processing unit (CPU) and on a graphics processing unit (GPU), reflecting common infrastructure choices for NSOs. Table 2 highlights two major observations. First, LLMs demonstrate higher F1-scores than classic string metrics, underscoring their ability to recognize nuanced textual patterns (e.g., acronyms vs. full names). Second, computational costs grow notably with model size, even when using a GPU. This is an important factor for large-scale deployments in official statistics.

During prompt development, zero-shot, few-shot, and chain-of-thought prompting strategies^10,11 were explored to encourage consistent entity-matching decisions. In practice, the primary objective of these experiments was to stabilize the output format (e.g., inconsistent responses) rather than to report a separate prompting ablation study. Table 2 reports results using the final prompt configuration used for evaluation.

Although LLM inference is slower than classic string comparators, the relevant comparison for operational use is often against manual clerical review rather than against GPU runtime. Clerical review requires trained staff time and consistent procedures, and it becomes costly when large fractions of pairs fall into the gray zone. By restricting LLM use to the ambiguous subset, the approach aims to trade additional compute for reduced human review effort, while still preserving auditability and reproducibility.¹²

Before proceeding, Table 3 demonstrates the Bayesian updating step used in the proposed framework. Six record pairs with different prior probabilities $P_{prior}$ , all within the clerical review region are considered. After processing each pair with the LLM and obtaining $D \in {0, 1}$ , the posterior, $P_{p o s t} (x, y) based on the$ formula in Section 2 is calculated. Finally, the thresholds $T_{L} = 0.35$ and $T_{U} = 0.85$ are applied to determine the classification outcome. The parameters were set to $θ_{1} = 0.86$ and $θ_{0} = 0.88$ based on the model's performance against the 1000 labeled records (reflecting true positive and true negative rates). In practice this can be done with any smaller labeled subset or targeted sampling strategy.

Table 3.

Example of Bayesian updating for six gray-zone pairs ( $T_{L} = 0.35$ , $T_{U} = 0.85$ ).

Pair	$P_{p r i o r} (x, y)$	LLM decision D	Posterior	Outcome
1	0.45	1	0.85	Auto match
2	0.52	0	0.15	Auto non-match
3	0.60	1	0.91	Auto match
4	0.55	1	0.90	Auto match
5	0.37	1	0.78	Manual review
6	0.75	0	0.32	Auto non-match

There are a few things to notice in the above table. First, in the Bayesian framework, when two probabilities are combined, the resulting posterior depends on how strongly each input leans toward or away from a match. When both probabilities fall below 0.5, they reinforce a “non-match” conclusion and drive the posterior even lower. Conversely, if both exceed 0.5, they reinforce a “match” verdict, pushing the posterior higher. In other words, whenever two independent pieces of evidence point in the same direction, whether match or non-match, Bayesian updating strengthens the conclusion by combining independent evidence, yielding higher confidence than either source alone. This principle lies at the core of the Bayesian framework. Note that not all cases are resolved. For instance, pair 5 remains uncertain because its initial probability was 0.37 yet the LLM output was 1 (match). Although the LLM decision was correct here, it might have been wrong. In that case, the updating mechanism provides a safeguard against misclassification. Next, the distribution of match probabilities is compared before and after the LLM-based Bayesian update.

Table 4 shows how Bayesian updating pushes many pairs out of the uncertain 0.35–0.85 range. Initially, 44.8% of pairs had probabilities in the gray zone. After incorporating LLM outputs, that proportion decreased to 21.5%. In practice, this means fewer borderline decisions for human adjudicators to handle, ultimately streamlining the linkage workflow. These findings show that selectively applying LLM outputs moves many borderline cases into clearer categories, reducing manual reviews.

Table 4.

Distribution of match probabilities before vs. after Bayesian updating.

Probability range	% of pairs (prior)	% of pairs (posterior)	F1 (prior)	F1 (posterior)
Low (< 0.35)	24.9%	37.2%	0.77	0.882
Uncertain (0.35–0.85)	44.8%	21.5%	0.60	0.80
High (≥ 0.85)	30.3%	41.3%	0.78	0.88
Total	100.0%	100.0%	0.74	0.87

A more granular analysis of the score distribution shift revealed a clear pattern across the gray-zone sub-bands (Table 5). Empirical observations suggest that the hardest cases, those with prior probabilities closest to 0.50, had higher error rate. This reflects that these cases are fundamentally difficult to resolve and often lack sufficient information for a definitive decision. These challenging cases were deliberately included. It is often the case in realistic modeling situations that the data fundamentally lacks predictive power (epistemic uncertainty).

In the current study, only gray-zone pairs are routed to the LLM because these are the cases where the baseline model is least certain and where LLM inference yields the highest expected value per unit of computation. By contrast, pairs near $T_{L}$ or $T_{U}$ (yet still in the gray zone) had lower error rate and were often pushed decisively below $T_{L}$ or above $T_{U}$ by the LLM's semantic understanding, eliminating clerical review. These pairs contained more predictive information, which the LLM could leverage (at the cost of greater computation) but that were not fully captured by standard string-matching scores. The results suggest that “gray-zone” cases are not uniformly resolved by the LLM. In this dataset, yield was greater near the edges because the LLM was more accurate there, whereas the mid-band had higher error rates and therefore require manual reviews. This triage acknowledges that LLMs provide a valid but imperfect signal and directs clerical effort to the residual hard cases that remain after Bayesian updating.

3.3 Strategies for improvement

Techniques such as quantization and batching can significantly reduce inference time and costs. Fine-tuning was considered as an extension, but this study focuses on zero-shot/few-shot prompting and Bayesian updating; fine-tuning is left to future work. Directions for work include experimenting with calibration techniques and leveraging the LLM's internal probability scores for Bayesian updating. Tools like Platt scaling or isotonic regression can potentially calibrate the LLM's probabilities. This could further improve the overall quality of LLM outputs and reduce the fraction of pairs sent to manual review.

Table 5.
LLM performance on sub-bands within the gray zone ( $T_{L}$ , $T_{U}$ ).

Band # pairs LLM F1 # auto resolved # manual

0.35–0.45 91 0.78 65 26

0.45–0.55 130 0.58 30 100

0.55–0.65 90 0.60 40 50

0.65–0.75 75 0.66 50 25

0.75–0.85 62 0.80 48 14

Total 448 — 233 215

Band	# pairs	LLM F1	# auto resolved	# manual
0.35–0.45	91	0.78	65	26
0.45–0.55	130	0.58	30	100
0.55–0.65	90	0.60	40	50
0.65–0.75	75	0.66	50	25
0.75–0.85	62	0.80	48	14
Total	448	—	233	215

However, even the best automated methods will encounter ambiguous cases. Developing enhanced tools for clerical review to highlight areas of uncertainty or conflicting evidence can potentially help experts make faster and potentially more accurate final judgments. Additionally, labeled data from human experts can be used to fine-tune the LLM, creating a reinforcing feedback loop (a “data flywheel”) that could be a valuable long-term investment for NSOs.

Next steps involve conducting larger trials and rolling out a prototype on tens of thousands, or even millions, of records to identify bottlenecks and integration issues not evident in smaller datasets. This includes securing on-premises infrastructure, ensuring data governance, and integrating the approach seamlessly into existing record linkage pipelines. By systematically exploring these paths to improvement—particularly in cost management, privacy compliance, and model calibration—statistical agencies can more confidently incorporate LLMs into their existing linkage workflows.

4 Conclusion

This paper has demonstrated how LLMs, when used selectively within a traditional probabilistic record linkage framework, can substantially reduce the clerical review burden for ambiguous record pairs and overall improve the quality of linkage if clerical review is not feasible. In these experiments, Bayesian updating that combines Fellegi–Sunter scores with LLM-based classification produced posterior match probabilities with stronger performance than string-only comparators, most notably in difficult free-text scenarios.

The aim is not to replace long-standing methodologies but to augment them with the nuanced textual understanding that modern language models offer. Through targeted prompting, selective invocation, and improved calibration, LLM-assisted record linkage can become a more efficient, accurate, and transparent approach for national statistical offices. LLMs are tools that rely on statistical patterns in training data rather than “knowing” ground truth and must be integrated with care, particularly under tight data governance constraints.

These findings suggest that the hybrid approach may improve data quality in official statistics while reducing manual burden. By continuing to refine model calibration, optimize runtime, and develop better human-in-the-loop processes, agencies can chart a realistic path forward that preserves the proven strengths of probabilistic linkage while harnessing the ever-improving capabilities of large language models.

Footnotes

Acknowledgements

The author thanks colleagues at Statistics Canada for helpful discussions on record linkage practice and for feedback on secure deployment considerations. Any errors or omissions remain the author's responsibility.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Fellegi

Sunter

. A theory for record linkage. J Am Stat Assoc 1969; 64: 1183–1210.

Jaro

. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 1989; 84: 414–420.

Zhong

Cui

Guo

, et al. AGIEval: A human-centric benchmark for evaluating foundation models. arXiv preprint. 2023.

Katz

Bommarito

Gao

, et al. GPT-4 passes the bar exam. Philosophical Trans Royal Soc A 2024; 382: 20230254.

Brown

Mann

Ryder

, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020; 33: 1877–1901.

Touvron

Lavril

Izacard

, et al. LLaMA: Open and efficient foundation language models. arXiv preprint. 2023.

Levenshtein

. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 1966; 10: 707–710.

Sarawagi

Bhamidipaty

. Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘02), 2002, pp.269–278.

Gelman

Carlin

Stern

, et al. Bayesian data analysis. 3rd ed. Boca Raton (FL): Chapman & Hall/CRC, 2013.

10.

Wei

Wang

Schuurmans

, et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint. 2022.

11.

Kojima

Reid

, et al. Large language models are zero-shot reasoners. arXiv preprint. 2022.

12.

Christen

. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Berlin: Springer, 2012.