OHDSI Home | Forums | Wiki | Github

Proposal: Identity Management Table for Standardized Linkage & De-Identification

(Daniella Meeker) #1

It seems like this should exist unless it fundamentally breaks core principles.

  1. Merge geocoded data re. social and environmental determinants on address information
  2. Prepare data privacy-preserving record linkage algorithms/Global GUIDs before loading CDM
  3. Merge identifiable data from public records (e.g. death)
  4. Standardize lookup procedures for re-identification
  5. Standardize de-duplication methods

FHIR Patient
or FHIR Person
US Core


  1. Allow for many-to-1?
  2. Include nested content? What should be denormalized?

This is part of a few other projects so I started extending FHIR

(Guan) #2

Hi All,

Reviving this old thread here as we also found it would be important to have a table to store linkage and de-identification data. I’m not sure whether if there is a way to standardize linkage as it is the linkage method is restricted by the type and content source data.

We are working with EHR data from extracted GP and Hospitals. Post-extraction we run the data through de-duplication and linkage process. Currently, there is no table where this linkage data can be stored in the CDM to reflect when a number of patients has been linked either due to de-duplication or linkage across different data sources.

For example:
One patient with three records from three different locations.

1 00DD82F2-F5AF-4347-877B-DFC6C4361ADC NSW_1
1 0016778C-8BC6-4ABC-BE0F-0683F27EB0F7 NSW_2
1 49B1CA71-9BF8-4E7B-A2E1-CDDF1B311100 NSW_3

(Christian Reich) #3


You are correct. It doesn’t exist. A database in OMOP CDM only expects one identifier per patient, the person_id in the PERSON table. All other tables depend on this identifier. Splitting it into many with linking tables would torture the model and kill the performance.

However, many people engage in linking of patients between data sources, to combine the information from each and to create a richer asset. That does require a process of linking, using identifiable information (often performed by a 3rd party) and de-duplication of information available redundantly in the sources. Typically, people create a combo database for all patients that can be linked between the sources, and the person_id is created de-novo to reduce possible re-identification of the source databases.

From an OMOP CDM perspective none of this is relevant to the model. All these transformations are part of the ETL of the database instance, i.e. before birth. When the analytic applications get to see the combined data asset there is no trace of the origins.

@Daniella_Meeker: Have you managed to create a standard for this?

(Clark C. Evans) #4

There is a GUID system used by NIMH that originated as part of the SFARI project. The method is described in the paper, Using global unique identifiers to link autism collections. Basically, a set of hashes based upon PII is created, and those are used to obtain a GUID. This GUID is then unique in most cases, but importantly, collisions can be detected and then participant organizations can work together to determine if the hash collisions are due to a coincidence or it’s actually the same subject enrolled in different studies. Something like this could be made to even broader scale for OHDSI. Even NIMH doesn’t get the PII, just the hashes.