Person_id: generating consistent and appropriately sized IDs in Databricks

mstachur · June 17, 2025, 6:35pm

Hello!

I’m at the early stages of implementing the CDM and something I’ve been thinking a lot about is how to best generate a unique identifier to use for the person_id. In our institution we’d like these IDs to be consistent over time, so to be able to add new rows to the data with new unique IDs generated.

One thing I’ve been conscious of is the size of the integer, since through some of the work group discussions I’ve learned others have encountered downstream limits on the size of the INT that would be usable (e.g., 28 or 32 bit max). While the DDL and specification document defines person_id as an INTEGER, I’m wondering if there’s general guidance on the INT size limits, and on the potential of using a BIGINT?

My implementation is within Databricks (i.e. Spark) and might potentially use dbt (though it could be just for steps downstream of person_id generation if it would add complexity to this process). One option in Databricks is BIGINT GENERATED ALWAYS AS IDENTITY. Though this creates a BIGINT, if we do not have more than 2 billion patients (which we wouldn’t expect) it seems like it could be cast as an INT for uses outside of Databricks if needed. row_number also seems like a reasonable alternative.

I’m wondering if others have insights to share about methods you’ve used for the person_id implementation in way that is consistent over time with incremental data additions and considers integer size limitations?

I’ve included below a summary of the main options I’m aware of within the Databricks context:

monotonically_increasing_id - BIGINT but can be cast as an INT; don’t think it will be unique with incremental addition
uuid - Returns an universally unique identifier (UUID) 36-character string, and if converted to INT would be very large
row_number - seems like incremental additions would be possible with implementation of tracking of the last maximum ID
xxhash64 - Hash function that creates a unique identifier based on the combination of several columns. Returns a BIGINT, but maybe other hash functions that return INT. The number of columns needed to create a unique value may result in large integer. Could experience non-uniqueness in very large datasets
BIGINT GENERATED ALWAYS AS IDENTITY - BIGINT; Requires use of Delta Lake, which might limit use of other tooling like dbt
pyspark.RDD.zipWithIndex - Re-indexes with the incremental addition of new data

Thanks for sharing your insights!

Chris_Knoll · June 17, 2025, 7:31pm

bigints across platforms is really tricky. There’s no real ISO standard on handling them like there are the floating point numbers we have in R, python, javascript, etc.

What i’d do is maintain an int-to-{insert your unique key format here} where when you identify someone new, increment your int, and maintain the map along with it. This way, you get 2 billion unique lives, but your insertion algorithm will keep it tightly packed. This ‘mapping’ is available during your ETL, maybe use person_source_value as what your mapping ID is since I think that’s a varchar.

In this way, you stay within the bounds of the CDM specs, but have this added flexibility for dealing with complex person numbering rules.

Laudy · June 18, 2025, 2:47am

Hi Chris, thanks for the helpful suggestion. Mapping a consistent INT to a unique source value seems like a solid way to stay within CDM specs while handling unique IDs reliably. We’re also considering this, but still figuring out how to manage the mapping table during incremental loads without risking duplicates. If anyone has a simple method for tracking the last used ID or keeping the mapping consistent, that would be great to hear. Also curious if this approach has caused any performance or compatibility issues in larger projects.

mstachur · June 18, 2025, 9:25pm

Thank you both for your responses!

@Chris_Knoll thanks for your suggestion! I’m generating the IDs as part of a process before the CDM generation begins in earnest, within an internal table that resolves unique individuals across our EHR systems and contains personal identifiers that won’t be in the CDM. I was originally thinking these generated IDs would map directly (as a foreign key) to the person_id column in the CDM person table, but it seems like it would work fine, as you suggested, to map that IDs to the person_source_value (converted from a BIGINT to a STRING) and then to person_id (as an INT) - all as the same number just saved as 3 different data types.

@Laudy In considering using row_number for this ID generation, I was thinking with every insert statement to add new patients I could just calculate the existing maximum ID value to start the incremental from there (like the code below), though I’m afraid there are probably some edge cases of concurrent transactions that might throw a wrench in this.

INSERT INTO INTERNAL_PATIENT_TRACKING

with max_person_id as (
  SELECT COALESCE(max(PERSON_ID), 0) as MAX_PERSON_ID
  FROM INTERNAL_PATIENT_TRACKING
)

SELECT
    CAST(ROW_NUMBER() OVER (ORDER BY EHR_PATIENT_ID) + MPD.MAX_PERSON_ID AS INT) as PERSON_ID
    , P.PAT_NAME
    , P.BIRTH_DATE
FROM EHR_PATIENT_TABLE P
LEFT JOIN max_person_id MPD

Chris_Knoll · June 19, 2025, 1:22pm

Yes, the use of a max() and a row_number will work, the trick is to only insert new person_ids for patients that you’ve determined are new. So, it looked like you went directly to your EHR patient table, but (if I am assuming that’s your raw patient table), you want to take your INTERNAL_PATIENT_TRACKING table to know ‘who you know’, join it with your EHR_PATIENT_TABLE to find out who is new, and among those you will row_number() + max(internal_patient_tracking) and generate new person_ids for only those new people.

As an aside, you could have a dedicated person_tracking database in something like Postrgres where you have access to a thing called sequence. Other transactional DBs (like MS SqlServer) support this too. In your row_number() + max approach, the ‘corner case’ that you may run into is that the selecting the max() from your tracking table will give you the max for that session and so if you run your person tracking loading across multiple servers/processes (such as, each external EHR site that you are merging into runs concurrently with each other), you will have person_id collisions because each of those might have gotten the same max(internal_patient_tracking.person_id). Sequences, however, are designed to yield unique next-values across all sessions.

To pseudocode how you’d use sequences:

-- Assuming you have a NEW_PATIENTS table/temp table that contains
-- the new patients that are not tracked in the INTERNAL_PATIENT_TRACKING yet

INSERT INTO INTERNAL_PATIENT_TRACKING (person_id, source_person_id)
SELECT nextval('your_person_sequence_name'), p.source_person_id
FROM  NEW_PATIENTS p;

Now, if you have multiple processes going, they each build their own internal ‘NEW_PATIENTS’ table they can run concurrently with each other because the nextval('sequence_name') is guaranteed to be unique across all DB sessions.

And having all this in a dedicated DB means you can manage backups of it independently of all the other ETL processes being performed, etc. Might be a good approach.

-Chris