I am looking for best practice in assigning integer primary keys.
Most (all?) CDM tables require integer primary keys. Many of our local transaction systems use non-integer keys, and the natural keys (like Medical Record Number) are not appropriate for sharing even when they are integers.
We are discussing several strategies for generating the primary keys.
- For tables frequently used as foreign keys in other tables (PERSON, PROVIDER, VISIT_OCCURRENCE, VISIT_DETAIL, and perhaps LOCATION AND CARE_SITE) – construct mapping tables in the data staging area from the natural key of the record to an appropriate integer. These would likely use the auto-key assignment facility of the database on row creation. These table would provide a lightweight way to perform foreign key lookups throughout the ETL cycle. After the ETL cycle the tables could be deleted.
Potential negatives: The PKs assigned in each ETL cycle would change, but would be consistent during the cycle. This is fine for flush and fill (complete replacement) batch processing. But would not support incremental (changes only) ETLs. Support for incremental changes might be accomplished if these staging tables were preserved and referenced during subsequent ETL
- For most other tables (which are not used as foreign key references) the PK integers could be assigned programmatically during the ETL, or an auto increment constraint can be added to the CDM schema. The auto-increment sounds most attractive in a flush and fill (complete replacement) batch process, but it does require adding a constraint to the standard schema.
Thoughts and advice welcome.