Proposal: Identity Management Table for Standardized Linkage & De-Identification

Daniella_Meeker · May 11, 2017, 3:46am

It seems like this should exist unless it fundamentally breaks core principles.
Uses:

Merge geocoded data re. social and environmental determinants on address information
Prepare data privacy-preserving record linkage algorithms/Global GUIDs before loading CDM
Merge identifiable data from public records (e.g. death)
Standardize lookup procedures for re-identification
Standardize de-duplication methods

FHIR Patient
http://hl7.org/fhir/patient.html
or FHIR Person
http://hl7.org/fhir/person.html
US Core
http://hl7.org/fhir/us/core/StructureDefinition-us-core-patient.html

Questions

Allow for many-to-1?
Include nested content? What should be denormalized?

This is part of a few other projects so I started extending FHIR
https://docs.google.com/spreadsheets/d/1PXM2NnF6EGkO9STv2Q14BA1btm_a9iby2Bqj0s4oSzk/edit?usp=sharing

guanguo · June 17, 2019, 4:47am

Hi All,

Reviving this old thread here as we also found it would be important to have a table to store linkage and de-identification data. I’m not sure whether if there is a way to standardize linkage as it is the linkage method is restricted by the type and content source data.

We are working with EHR data from extracted GP and Hospitals. Post-extraction we run the data through de-duplication and linkage process. Currently, there is no table where this linkage data can be stored in the CDM to reflect when a number of patients has been linked either due to de-duplication or linkage across different data sources.

For example:
One patient with three records from three different locations.

LINK_ID	UID	Site
1	00DD82F2-F5AF-4347-877B-DFC6C4361ADC	NSW_1
1	0016778C-8BC6-4ABC-BE0F-0683F27EB0F7	NSW_2
1	49B1CA71-9BF8-4E7B-A2E1-CDDF1B311100	NSW_3

Christian_Reich · June 17, 2019, 5:33am

@guanguo:

You are correct. It doesn’t exist. A database in OMOP CDM only expects one identifier per patient, the person_id in the PERSON table. All other tables depend on this identifier. Splitting it into many with linking tables would torture the model and kill the performance.

However, many people engage in linking of patients between data sources, to combine the information from each and to create a richer asset. That does require a process of linking, using identifiable information (often performed by a 3rd party) and de-duplication of information available redundantly in the sources. Typically, people create a combo database for all patients that can be linked between the sources, and the person_id is created de-novo to reduce possible re-identification of the source databases.

From an OMOP CDM perspective none of this is relevant to the model. All these transformations are part of the ETL of the database instance, i.e. before birth. When the analytic applications get to see the combined data asset there is no trace of the origins.

@Daniella_Meeker: Have you managed to create a standard for this?

cce · June 17, 2019, 9:27am

There is a GUID system used by NIMH that originated as part of the SFARI project. The method is described in the paper, Using global unique identifiers to link autism collections. Basically, a set of hashes based upon PII is created, and those are used to obtain a GUID. This GUID is then unique in most cases, but importantly, collisions can be detected and then participant organizations can work together to determine if the hash collisions are due to a coincidence or it’s actually the same subject enrolled in different studies. Something like this could be made to even broader scale for OHDSI. Even NIMH doesn’t get the PII, just the hashes.