Use string instead of long for primary keys

parisni · November 23, 2019, 11:15am

Primary keys have for decades been sequential numbers. OMOP moved recently from 4 byte integers to 8 byte integers (bigint).

Modern distributed system do not deal easily with sequential generation of such keys.
Old analysis system does not handle bigint well.

Moving from bigint to string primary keys would have several advantages:

allow use of GUID (which is distributed friendly)
allow to pack multiple OMOP dataset together when needed (no collisions on GUID/String)
ease the ETL (no need to maintain PK and FK as numeric)
allow R to handle them natively
simplify loading distributed datawarehouse within OMOP

This has several drawbacks:

probabrly impact on several existing code and software
slow processing (?)
increase dataset size

All together, I would say that a breaking change would benefit for the future of this project.

Thought ?

Chris_Knoll · November 23, 2019, 4:08pm

I would not support this change (to use String).

Storing a string as a key means that you might be tempted to put some sort of information embedded in the ‘key’ of the data. But these identifiers should not contain information about what the record is in the identifier…that’s called an ‘antipattern’

In addition, the resource size of the identifier would be extremely large compared to the equivalent 4 byte bigint:

549755813888 = 4 bytes
“549755813888” = 6*2 bytes = 12 bytes plus any varchar storage overhead in the encoding. It would be 24 bytes if your varchars are stored as UTF-16.

Finding the record by the id would also be a more expensive operation:
549755813888
549755813828

These strings are the same length, but you’d have to calculate a hash on it and compare those values…and if you’re storing the hash of the value anyway, then you’re not gaining anything except a ‘readable’ key which means you’re storing information in the key.

Indexing on these columns would be impacted from the larger storage size: less rows would fit within a page, hence making your B-tree much larger.

I am not sure I understand:

What does the data type of the PK/FK have to do with the ease of maintaining it?

Edit:
<oops>
I’m pretty sure R handles numbers natively very well.
</oops>

I understand your point now on R and 64bits…if you want to do certain operations on vectors of values (like avg, sum, etc), but…these are keys, you won’t be doing that. the as.numeric() will handle that datatype.

You mention a GUID, which is not a string, but a 16 byte value. But it’s just another numeric value that is formatted as xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx…in order to use it in a query, you’d need to convert the string form into the underlying GUID value, which causes more overhead. But, a GUID is more ‘distributed network’ friendly in that any peer that generates a record of data can generate their own GUIDs and not have a real risk of collision with another datasource.

I’d be more supportive of a GUID as a key value than a ‘String’, so maybe if you change the title of the post to be specific about GUID, you might get more support

But I am having trouble finding information about ‘native’ support of GUID/UUID in R. Are you sure it’s natively supported?

-Chris

Mark · November 25, 2019, 2:05pm

I agree with what you said about strings as keys but GUID’s are slower to match than BIGINT as well. Requesting a BIGINT key should work fine, even in a distributed system. Now that SQL Server has Sequences like everyone else, I see no reason to ever go back to the old days of using GUIDs on a ACID compliant service.

It is as easy as

SELECT (NEXT VALUE FOR distributed_sever.DBO.some_seq);

The syntax for Postgres is very similiar.

I am assuming that everyone is running their instance of OHDSI on an ACID system.

parisni · November 25, 2019, 3:31pm

Forget about string. Let´s talk about GUID. And yes @Chris_Knoll I was thinking about this R limitation

That´s would be true if OMOP was an OLTP database. AFAIK it´s more OLAP and transaction are not that useful with OMOP.
This is the point, we use distributed datawarehouse where long sequence are not trivial to handle. I am thinking GUID is an interesting feature for such dataset, in particular to UNION two omop dataset from two institutions and no collisions.
In case of OLAP distributed databases, the question of performances and size does not apply anymore.

Chris_Knoll · November 25, 2019, 7:11pm

If I were to merge two datasources together (and I’m not suggesting that I would, there are specific reasons that you would not merge datasets together), I would simply create a patientID map as:

CREATE TABLE patient_lookup
AS
select row_number() over (order by person_id) as person_id, source_pid, source_id FROM (
select DISTINCT person_id, 1 as source_id from source_1.person
UNION ALL
select DISTINCT person_id, 2 as source_id from source_2.person
) D

And when you merge the patient-level data, you just join to your patient_lookup table to resolve the source-specific IDs to the person_id that you use in your domain tables. You can store the source’s person_id in the Person.source_value column.

person_ids, are the only real FK that you need to deal with in the CDM. The other FK’s are conceptIDs, which will join to a consistent Vocabulary (you do have consistent vocabulary between your different CDM sources, right?) Other FKs you many need are provider_ID, but you’d solve that issue the same way you manged unifying your patient_ids.

I’m affraid I have to disagree with this: performance and size always applies, even if you have massive parallel capacity.

parisni · November 25, 2019, 8:35pm

Sure there is workaround. But everey PK / FK will have collision (exept concept_id) condition_occurrence_id, visit_occurrence_id, visit_detail_id and so on.

this query cannot be distributed and as a result runs very poorly on (all?) MPP solutions.

In the general case, GUID would simplify the future. dealing with such large string is not as user friendly as a numeric sequence I admit.

well MIMIC OMOP database is 500GB once indexed on postgres. The same dataset is 9GB as parquet files on a hadoop cluster. Not sure turning bigints to GUID or string affects significantly the user experience in this case.
BTW using GUID for attribute keys simplifies a lot the ingeneering aspects.

Mark · November 25, 2019, 8:54pm

Performance may be inconsequential to your specific needs, but those of us that will have OMOP on a dedicated database that is fire-walled away from all production data, performance becomes much more important.

GUIDs are harder to work with in SQL as well( readability, casting, PK and FK). Every modern database has a version of BIGINT and pretty much all the syntax is the same. In general, GUIDs are an all around bad idea for generic SQL unless there are no other options( and there almost always is).

The call to get a key from one database server to another is very cheap.