Generating unique, stable identifiers without collisions

MPhilofsky · January 5, 2023, 10:54pm

How do you generate unique, stable identifiers without collisions at your institution? Currently, we are using Google’s Farm Fingerprint to create our PKs in the CDM. However, as we add more data, we are having more collisions in the generated identifiers even though the inputs are different.

Eduard_Korchmar · January 5, 2023, 11:00pm

In the Jackalope we use hashing algorithms. Git uses SHA256 algorithm, I used BLAKE, there are different flavors but end result is the same: mathematical impossibility of function output collision.

mgkahn · January 5, 2023, 11:39pm

@Eduard_Korchmar Google also claims that their farm fingerprint hashing algorithm is highly unlikely to have collisions. But as our tables have exceeded multiple billions of rows we ARE seeing collisions with different inputs resulting in the same hash output despite the long odds. So we are seeking advice about other hashing algorithms that have extremely high entropy characteristics that project into a very (very, very, very) large hash space. Have other institutions with very large measurement or observation tables (eg. all of the flowsheet rows in their Epic instance across 11 hospitals) or huge billing data sets run into collisions in generating primary keys? If so, what did you do to break these collisions. Thanks everybody.

MPhilofsky · January 12, 2023, 8:29pm

One other requirement, the output needs to be an integer since it will be used as the PK for the OMOP CDM tables.