Privacy and security.

Privacy preserving record linkage (PPRL).

Last updated January 30, 2026

1. Beginner topics:

  • In modern database construction, there are many scenarios in which multiple institutions work together to provide as much information as possible for each data point.  For example, a patient may receive a blood test at one hospital system and imaging data at another hospital system; similar issues may arise in a variety of other fields as well, including government, finance, and education.  In order to build a database that can utilize both the blood test data and imaging data in a more complete manner, we need methods that can identify when data from these multiple institutions belongs to the same individual or entity.  This is called record linkage.   

    Traditionally, this can be done simply by linking certain data characteristics that, when combined, are likely to be unique.  Examples include a name, date of birth, or address, among others.  However, there are more complicated scenarios that can arise, such as if a person’s name was misspelled in one system or if there are rules that limit the amount of information that can be shared between institutions.  Because of this, there are many different categories and types of record linkage that are important to consider, including deterministic vs. probabilistic linkage, centralized vs. distributed systems, and privacy-preserving linkage (the main focus of this material).  In this learning module, we aim to provide learning materials that can help readers understand fundamental aspects of record linkage, provide technical foundations, and ultimately serve as a guide for a practical implementation of privacy-preserving record linkage. 

    Acronyms and key definitions in this section: 

    Record linkage –  

  • Before we dive into record linkage and the main topic of this module, privacy-preserving record linkage, we first need to demonstrate why there is a need for more complicated linkage techniques than directly comparing common identifying variables.  In general, these can be broadly categorized between practical considerations and regulatory or ethical reasons, both of which are discussed in more detail in this section. 

    Practical considerations

    While there are many realistic aspects of record linkage that can cause problems, one of the most fundamental is noisy data.  Noisy data refers to information that is, for some reason, not entirely accurate or consistent.  For example, suppose “John Smith” has to have both a blood test at Hospital A and a chest X-ray at Hospital B.  But when he checked in at Hospital A, there was a spelling mistake when his name was saved in the record system and is stored as “Jon Smith.” This would be a noisy data point caused by human error.  In addition to human error, there are a variety of other causes for noisy data, including changes to a person’s identity over time (e.g., moved addresses, changed last name), missing or incomplete data (e.g., one institution includes a middle initial, one does not), or inconsistent formatting (saving dates as MM/DD/YYYY vs. DD/MM/YYYY).  While these could potentially all be corrected on a small scale, in large-scale databases across multiple institutions, the only solution is to utilize techniques that can account for noisy data. 

    The existence of increasingly large databases in modern times actually causes additional practical issues related to scalability.  Specifically, a direct comparison between every record in one database with every record of another database is very computationally expensive.  For example, if two databases each have 10 million records each (which is quite reasonable in modern standards), then comparing all of the records between the 2 would be  

    10,000,000 𝑥 10,000,000=100,000,000,000,000=100 𝑄𝑢𝑎𝑑𝑟𝑖𝑙𝑙𝑖𝑜𝑛10,000,000 x 10,000,000=100,000,000,000,000=100 Quadrillion

    Comparisons.  Including any additional databases would quickly make direct comparison too time-consuming to be practical.  Thus, on top of needing to account for noisy data, any record linkage solutions need to be able to be applied effectively at large scales.   

    Finally, while direct comparison can be effective for well-structured data such as name, date of birth, and age, among others, there are more complex forms of data that need to be considered in modern computing.  One key example of this is imaging data, where we want to identify if an image (e.g., a CT scan) belongs to a certain person in the record system, but any direct comparison with an image record requires novel, unique considerations that direct comparison cannot achieve.   

    While the practical considerations listed here are not exhaustive, they are only part of the reason why we need more advanced record linkage techniques.  The other primary motivation is regulatory concerns regarding the ethical handling of data, which often need to be addressed simultaneously with the other practical restrictions. 

    Regulatory and ethical considerations

    In recent years, many laws and regulations have been formulated regarding the appropriate use and management of potentially sensitive information.  In healthcare, a key example of this is the management of personally identifiable information (PII), which is restricted in the U.S.A. by the Health Insurance Portability and Accountability Act (HIPAA) or by the General Data Protection Regulation (GDPR) in Europe.   These laws provide guidance on when and how certain forms of data can be shared and what level of consent or justification is needed prior to transferring data between institutions.  

    This is applicable to record linkage in that many variables that would be useful in matching data from multiple institutions fall into the PII category (e.g., names), and while it may be possible to match data without using PII, it will often be significantly more difficult and inconsistent.  Thus, we need to develop techniques that disguise data such that it no longer contains sensitive information and can legally be transferred while maintaining the ability to match records between institutions under systematic constraints.   

    Acronyms and key definitions in this section: 

    HIPAA –  

    GDPR –  

    Noisy data –  

    PII–  

  • Privacy-preserving record linkage (PPRL) is a solution to many of the problems discussed in the prior section.  Specifically, PPRL methods are those that allow for record linkage across datasets without sharing protected information (e.g., names, dates of birth); instead, the identifying information is somehow transformed into a meaningful but obscured representation that are designed such that sensitive information is protected from malicious intent.  This can be accomplished using various techniques, including hashed tokens, bloom filters, and embedded representations, among others, each of which is discussed in more detail in a later module.   

    The inclusion of PPRL technology allows for institutional collaboration to be more scalable and trustworthy, addressing many of the issues presented in the previous section.  In particular, common techniques such as blocking and indexingcan improve computational efficiency, institutions eliminate the risks of sharing and potentially exposing sensitive data, and entities such as an honest broker can simplify multi-institutional collaboration.  These, among other reasons, enable PPRL to serve as a foundational element of modern database construction. 

    Below, we provide a high-level, general workflow describing the necessary steps for PPRL.  Note that some of these topics are discussed with more detail in later modules for more advanced readers. 

    • Each collaborating institution collects and standardizes their data.  This could include identifying potential linkage data fields, ensuring formatting is consistent, and addressing missing data if applicable. 

    • Each institution transforms the appropriate data into encoded, protected representations for linkage.  This could be through hashed tokens, bloom filters, or other more advanced techniques.  In the case of a trusted third party (e.g., honest broker) serving as a linkage performer, data encoding may potentially be bypassed. 

    • Linkage data are transmitted to the appropriate source for linkage.   

    • Linkage is performed, including data blocking and other techniques that reduce the computational workload, similarity metric calculation between representations, and determination of matched data across institutions. 

    In this manner, PPRL preserves the ability to link data while respecting privacy and practical constraints.  In the following sections, we discuss each of these steps in more detail, including different algorithmic and mechanistic design choices that influence PPRL systems. 

    Acronyms and key definitions in this section: 

    Honest broker –  

    PPRL –  

  • In practice, record linkage is not typically performed by a single institution but as part of a collaboration between multiple institutions, each of which may store data about the same individuals or entities.  Often, these institutions will utilize different technical systems, have different rules and restrictions regarding the use of their data, and may or may not fully trust one other; these make direct data sharing, at best, impractical.  Thus, while the technical aspects of data management and sharing in PPRL are critically important, the formation of a transparent, trustworthy collaboration structure is just as crucial.   

    A common solution to this problem is the establishment of an honest broker.  This is a third party that is trusted by all other members of the multiparty collaboration and, when utilized, serves as a critical cog in the data transmission and linkage process.  When an honest broker is used, institutions in the collaboration will transmit identifying data to the honest broker, either in an encoded representation or following rules of data use and data transfer agreements, then the honest broker performs the linkage process and returns protected linkage results to each institution.  In this setup, the key point is that collaborating institutions do not receive each other’s raw data, only the honest broker is allowed to receive data and evaluate it following previously agreed on processes (i.e., regarding governance and security).  This can simplify the PPRL process because of the institutional trust placed in the honest broker; since raw or minimally transformed data is transferred, more traditional linkage techniques can be utilized, typically resulting in improved linkage performance.  However, this collaboration structure is reliant on the identification of a trustworthy honest broker, which may not always suitable (e.g., if data are considered particularly sensitive).   

    Alternatively, PPRL may take advantage of some other entity (e.g., a secure, centralized server) that receives only fully transformed data and performs the linkage process without ever seeing the raw identifiers.  In contrast to the honest broker structure, the use of such linkage units guarantees that no single entity ever has raw identifiers in a single location and privacy protection is achieved through technical design rather than trust in a third party.  While this may be advantageous in some scenarios, it can be technically complicated to ensure that workflows and encoded representations are standardized since raw data is never shared.  In practice, a combination of linkage systems and an honest broker may be utilized to attempt to maximize privacy protection and still obtain strong linkage performance. 

    Acronyms and key definitions in this section: 

    Honest broker –  

More advanced topics coming soon!

Previous
Previous

Interoperability

Next
Next

Representativeness and reliability