Assigning integer primary keys

acoltri · June 28, 2020, 7:10pm

I am looking for best practice in assigning integer primary keys.

Most (all?) CDM tables require integer primary keys. Many of our local transaction systems use non-integer keys, and the natural keys (like Medical Record Number) are not appropriate for sharing even when they are integers.

We are discussing several strategies for generating the primary keys.

For tables frequently used as foreign keys in other tables (PERSON, PROVIDER, VISIT_OCCURRENCE, VISIT_DETAIL, and perhaps LOCATION AND CARE_SITE) – construct mapping tables in the data staging area from the natural key of the record to an appropriate integer. These would likely use the auto-key assignment facility of the database on row creation. These table would provide a lightweight way to perform foreign key lookups throughout the ETL cycle. After the ETL cycle the tables could be deleted.

Potential negatives: The PKs assigned in each ETL cycle would change, but would be consistent during the cycle. This is fine for flush and fill (complete replacement) batch processing. But would not support incremental (changes only) ETLs. Support for incremental changes might be accomplished if these staging tables were preserved and referenced during subsequent ETL

For most other tables (which are not used as foreign key references) the PK integers could be assigned programmatically during the ETL, or an auto increment constraint can be added to the CDM schema. The auto-increment sounds most attractive in a flush and fill (complete replacement) batch process, but it does require adding a constraint to the standard schema.

Thoughts and advice welcome.

mkwong · June 28, 2020, 9:56pm

There is value in using the default - AUTO_INCREMENT in all the primary keys within the OMOP CDM as it essentially de-identifies the records for retrospective and data sharing use. It also simplifies the ETL design and implementation regardless of whether you do a full refresh or incrementally add records.

In our veterinary OMOP warehouse we do incremental addition to the OMOP CDM and in the near future will participate/share in collaborative and pooled OneHealth research use cases. However, there is a need for local - within the firewall - use of the OMOP CDM by authorized investigators. This is handled by implementing a local-only master_index table that retains the EHR’s medical record number, encounter number, and other patient identifiers along with the patient’s automatically assigned Integer person_id value. Therefore each day the ETL runs, it looks up the person_id from master_index table and then adds records to the OMOP CDM. If I need to share the data to authorized outside collaborators, I can dump the OMOP CDM excluding the master_index table without further de-identification steps or processes.

The master_index can also retain the randomly assigned time-shift value as well if you want to hide both the MRN/Encounter number and all date/time values assigned in the visit_occurrence, observations, procedure_occurrence, etc.

There are other methods and approaches, but the above seem to work for us in an automatically daily upload/update implementation of an OMOP 5.2 CDM.