Concepts for CDM tables and fields - how should we design them?

Christian_Reich · October 21, 2018, 3:42am

Friends:

We need your help, especially folks involved in building tools and methods. I am tagging @Chris_Knoll, @anthonysena, @Frank, @gregk, @pavgra, @schuemie, but there are a whole lot more folks who should have an opinion:

With CDM V6 we have a proposal to fix one of the shortcomings and turn it into a full information model. This requires that it can do self-reference to any CDM structure dynamically in the data. In particular, we want to replace the rather unspecific Domains with concepts for all table and fields. As a consequence, we can link records between any tables together, like in FACT_RELATIONSHIP or in the upcoming EPISODE table. Episodes will be like Cohorts and Eras, but have Fact Relationship-type links to the events that belong to it. These links will work through the combination of a table_concept_id and and event_id (the foreign key in that table). Alternatively, it could be a field_concept_id and event_id.

But there is another use case we can cover: Tracking CDM versions for OHDSI tools. An application like ATLAS, or any of the methods, is built with a certain version in mind. When we have the METADATA table properly implemented it could check for the version number and refuse to start up in the wrong one. However, that may not be smart, since older and newer versions might be compatible enough for what the tool is trying to do. But how does the tool know when it can safely use the data?

Of course, the developers could now go through all the DDLs and figure out and test and test, but that is not realistic. Here is a potential solution:

Each table and each concept is represented by a concept as proposed. When the table or field definition changes a new concept gets made.
There is another concept for each CDM version. All table and field concepts that are active for that version have a record in CONCEPT_RELATIONSHIP linking them to the CDM concept (relationship_id = ‘belongs to’ or something like that). If a field concept doesn’t change between versions it will have a link to both.

So, when a tool gets released, it should capture all concepts for the tables and fields it touches. Let’s say person.person_id, person.year_of_birth, condition_occurrence.condition_concept_id and drug_exposure.drug_concept_id. To check whether it’s safe to run on a given CDM, all it has to do is to pull the concepts linked to the CDM version concept mentioned in METADATA and check whether those 4 are all in that list. If yes, the CDM is compatible and the tool can launch. If not, it’s not safe.

Does that make sense?

If yes, we need answers to the following questions:

What constitutes “change of a table” and “change of a field”? Is going from integer to big_int a change? Nullable to not nullable? Change of field name?
Do we even need a table concept? Aren’t we always referring to a certain field?
Which database flavor should we use? Or should we have table and field concepts for each flavor, and their life cycle could be different?

Chris_Knoll · October 21, 2018, 6:14am

Hi, @Christian_Reich,

I’m sorry to be a wet blanket, but what you describe sounds like being able to interrogate some metadata and determine if the CDM was constructed in a manner that would be suitable for a given tool.

There is already an invention to support this, it’s called ‘semantic versioning’: https://semver.org/

With semantic versioning, each version of the CDM has a set of known capabilities. Tools can be written to be ‘compatible’ with a range of versions (such as > 5.1 or 5.0 - 6.0). You can play with a semver range calculator here: https://semver.npmjs.com/

I do not think interrogating a table of metadata to derive if a CDM has the necessary support of fields and concepts (?!?) to support a tool/analysis is feasible. semantic versioning is exactly what solves this problem.

But in order for semantic versioning to work, you need to have very strict discipline about how you version each release. Ie, 5.2 should support all capabilities in 5.1 and 5.0, but 4.x range is not required. The massive change to the cost table between 5.0.0 and 5.0.1 would have violated semantic versioning.

You also might consider associating a release of the vocabulary with a specific version of the CDM, such that you can say that 5.3.1 is tied to jan1,2018 vocabulary, while 5.3.2 is tied to jul1,2018 (for example). Then the tools can not only check for the necessary cdm capabilities but also the necessary vocabulary support just by checking the version of the cdm against the version the tool supports.

I tried looking through the proposal link you provided, and I couldn’t quite divine the precise requirements that the proposal is attempting to satisfy, but if we’re talking about havign a single table with a domain_concept_id, field_concept_id, and a value, then I think we’re aligning to I2B2’s E-A-V (entity-attribute-value) schema: https://en.wikipedia.org/wiki/Entity–attribute–value_model

How…interesting!

gregk · October 21, 2018, 8:23pm

Hi Christian,

I read this post a few time and have to admit that I do not completely understand what shortcomings you are trying to address and what you are proposing. Based on what I was able to gather, there seems to be a couple of separate issues here:

CDM / Tools Versioning
OMOP Standardized Vocabularies
I completely agreed with Chris’ post on semantic versioning. And I thought that OHDSI does actually use it today already? We just need to better at following the rules - I still have high hopes that we, as a community, will better embrace them starting from CDM version 6 and ATLAS version 3.

Major releases version are made to be incompatible so that we can bring improvements and enhancements
Minor is to enhance but be backward compatible
Then patches, in case things need to be fixed. This one is a bit tricky when talking about a model but one example comes to my mind if INT vs BIGINT change

we do not need to create complicated tools to try to interrogate the (database specific !) implementation of the model to figure out a version - we just need to be good at capturing it in a consistent format in METADATA table as a set of properties. Something like

MAJOR_VERSION
MINOR_VERSION
PATCH_VERSION

In ATLAS WG, we already having a discussion that ATLAS needs to support backward compatible version and have a message if the featured chosen is not supported by the version of the CDM

OMOP Standardized Vocabularies (a.k.a. Ontology). Here are some thoughts on this:

This is called “reference data”
It should be versioned at the whole bundle level. A simple date stamp is a good enough versioning strategy - the model is not changing, no need for semantic versioning - even though we could easily introduce it here too
I actually think that having a DATABASE MODEL AGNOSTIC ONTOLOGY is a good thing - and not a bad thing - why do we need to have it “fixed”? Why do we need to bind our reference data to tables, never mind that tables are just database specific names for a higher concept that exist in conceptual model. Please remember, the types of information model are:
Conceptual (we should really do that)
Logical (OMOP CDM - somewhat)
Physical (DDL, Java classes)

If Chris is correct in his assessment that you might be leaning towards the EAV model - please do not go there, this is a such a common trap that have existed ever since database have been invented. The initial reaction that people have when they think of this design - “hey, great solution for versioning and it is so flexible.” And with this model, you can shove anything into a simple structure that does not ever change. But then you cannot really easily query it without inventing some sort of a “query framework” and deep knowledge of all implicit “tables” and “fields”. The data quality becomes an issue as constraints are not easily enforceable too.

How…interesting!

to rephrase, "How…hhmmm…“interesting”

But before I/we jump to conclusions - could you please provide some actual examples of how things / data is going to look like? Similar to Chris, I could not easily follow the proposals you referred to.

Christian_Reich · October 21, 2018, 10:10pm

Friends:

Argh. I want to remove the whole thing and start from scratch. I don’t think I made the idea clear. Let me start with Chris “the wet blanket” @Chris_Knoll and put him in the dryer:

Semantic versioning is solving a different problem THAT WE ALSO HAVE: To have a clear system of which CDM version is backwards compatible with the previous ones. The problem is that most of our versions are not backwards compatible. They break things. So, by that logic, we should be on version 42, as the website jokes. Of course we could be much better organized and pool incompatible versions together. And we promise we will get better.

Now that we know we have that mess, different parts of the CDM move quicker than others. For example, the CONDITION_OCCURRENCE table hasn’t changed much in a while. And the condition_id + condition_concept_id + condition_type_concept_id fields have been stable since V4. So, if all your tool touches are these three fields you are a lot more backward compatible than the Version number suggests.

And that’s the nature of the proposal. The vocabulary team wants to help with this. Reason is we will have concepts for each component of the CDM anyway, and we have a mechanism for linking them together and we have a mechanism for the life cycle. So, if a field in a table stays the same it keeps the concept_id referring to it. If it changes the concept gets an valid_end_date and a new concept gets made for the new changed field.

So: Instead of having one CDM version, condensing all changes into one number, and upping it even for the tiniest spell fix, we follow the various components separately. Well, we do both, really, the overall version and the detailed follow-up.

Does that make sense?

Christian_Reich · October 21, 2018, 10:36pm

@gregk

Good stuff. Let me go one by one:

It is the former. As I said before. Most proposals are not backwards compatible. In fact, I am the one trying to hold things together somehow, as for example in this debate, where, incidentally, you sided with the “incompatible” camp.

So, we have three choices:

We pool better
We create one major version a month
We violate the rules

so far we have done mostly 3). If we want to do 1) that means that some proposals will have to wait. Folks don’t like that. 2) is probably a bad idea, makes us look out of control, and we will lose trust of the community.

100% agree. And having concepts for all artifacts will help making this a non-RDBS-style ontology. @rimma has been pushing this for the longest time. But again, different subject.

Actually, here is where I don’t get what you guys are saying. Where is here an EAV model?

Chris_Knoll · October 22, 2018, 12:09am

The proposal you pointed to in your origional post described the need for each table and each column in the CDM to have a concept ID. It sounds like that’s what you’d need in an EAV model. You’d have 1 table:

table_id : bigint
column_id :bigint
value: varchar

And to store a condition occurrence record you’d have something like:
“condition_occurrence_concept_id”, “condition_concept_id_concept”, “Acute MI ConceptID”
“condition_occurrence_concept_id”, “conditionstart_date_concept_id”, “1/1/2012”

Etc etc, one row per column. It would take an EAV model 12 rows of data or so to represent one record in the condition_occurrence table. Don’t blame me, man, that’s exactly what it sounded like you were describing.

While this is true, the tool development can’t produce tools as fast as ideas get voted on.

What I recommend is that the CDM have three parts that have a SemVer attached to it: core, extension and vocabulary. New features that want to get introduced at a fast pace can be introduced under the ‘extensions’ part of the CDM specification and versioned independently from core. Things that are stable and not changing can be put in ‘core’. You can move things from extensions to core via a major version update to core, and then remove the feature from extensions via a major version update to extension. Tools can decide for themselves what version of ‘core’ and ‘extensions’ they are compatible with.

We could also make certain functions dependent on a specific version of the vocabulary, but in that case vocabulary will need to start paying attention to semver rules with each release.

gregk · October 22, 2018, 1:21am

The problem is that most of our versions are not backwards compatible

I do not think it is necessarily completely accurate. Let’s agree on a definition of backward compatibility first. It basically means that someone does not change or delete any of the existing interfaces but new interfaces can be added. Thus new model is basically an old one plus more. We did a “bad” job between 5.0 and 5.1 but we did a pretty decent job between 5.1 and 5.3. We should be very strict about it going forward from version 6.

I really like these, and propose we could merge them together thought a very simple concept - a maturity level. We used to use this quite successfully in one of my previous organizations (actually, very relevant - it was focused on data and integration) - we defined the concept of the “maturity” that we tagged every interface/object with:

Mature - this is a very mature object that is rarely changing - if at all. MUST apply semantic versioning, MUST be backward compatible. Do not touch it.
Emerging - this is an object that proved to be useful, but it is still evolving - expect some changes. Subject to a discussion. This phase lasts for up to one Major release after which it must become Mature.
POC - this is a great idea worth exploring, definitely not a production grade but the requirements will come out of it. It will either become “Emerging” or “Mature” or goes into a trash bin. This is decided at each Major release milestone.

Merging these with Chris’ idea of Core and Extension - I think it would work great. Basically, anything that is “Mature” - must go into Core, “Emerging” and “POC” - into Extension. I am not sure about the “Vocabulary” thought - I do not treat it the same as OMOP CDM, it is a reference data set. But the actually concept definitions - or even whole domains - could be applied the same maturity categories.

And we still need to be better at applying and enforcing versioning and not to be afraid to break the compatibility at each major release to optimize and improve things - this is what it was created for.

I am not sure how I sided here with the "incompatible camp. We SHOULD BE using a major release to break a compatibility to optimize things - not shy away from that. If that would be a final, reviewed, released, adopted and actually used version - I would never advocate for a change like this. I would propose that going forward, we treat every release with more formality - I actually do not recall at which point it became an official release - but please forgive me if I missed it.

gregk · October 22, 2018, 1:37am

I also would like to refer to this again. Both OMOP CDM and Vocabs are derived from the Conceptual model - maybe just instinctively at this point - and they are already linked through that. There are certain definitions in that conceptual model (aka “concepts” - not to confuse it to the way we used this word in vocabs) that are mature, and some that are evolving. We should stop thinking in terms of tables and start thinking in terms of higher level objects - person, condition, procedure, drug exposure, visits, death, provider (HCP), care_site (HCA), cost etc. and label them as either “mature”, “emerging” or “poc”

Christian_Reich · October 22, 2018, 8:18am

OMG. Hell no!!! The proposal is to have in the vocabulary concepts representing the tables and fields. That is all! The actual CDM stays exactly the same.

The only place these concepts would be used is to do the dynamic referencing in places like FACT_RELATIONSHIP. Instead of using domain_id=‘Drug’, and then somehow “knowing” that this means DRUG_EXPOSURE (or maybe DRUG_ERA) we use the concepts to refer unambiguously.

And then the idea came up to once we have those concepts we could use their life cycle for compatibility management. As a fix for the otherwise sub-optimal versioning.

Chris. I agree this is a good idea in principle. But before we toss out new ideas can we comment on the idea actually proposed? Because it is a natural extension to the creation of concepts for each table or field we are doing for the referencing, not the version problem.

Very hard. 90% of the vocabulary we don’t control. And it doesn’t affect functionality, because they constitute a reference table to the normalized content, and not administrative tasks (like domains etc.). I would not count on something like that anytime soon.

I wish I could agree. But we did a bad job all along. Non-compatible changes between 5.1 and 5.3 are:

#64 This removes the datetime fields from OBSERVATION_PERIOD
#92 Fixes modifier typo in PROCEDURE_OCCURRENCE
#73 Removes EFFECTIVE_DRUG_DOSE and DOSE_UNIT_CONCEPT_ID from DRUG_EXPOSURE
All bug fixes

Look guys: We proposed the concept-based model for compatibility management. It’s fine if you hate it or found a problem with it. But instead of critiquing you are tossing out new ideas that are totally different and make me comment on these new ideas of yours.

Here it goes: They look good in theory. In practice: I have no idea what constitutes a “core” or “mature” level. For example, we used to have a DEATH table. Looked pretty robust to me. Until one day @pavgra complained about the potential model conflicts if there is more than one record. A debate broke loose, and as a consequence we abolished (!) the table altogether.

The concept-based approach does not try to make this lose army of volunteers a tightly organized software development machine. Instead, it creates a pragmatic solution to a problem that results from the somewhat chaotic ways a community develops a standard.

schuemie · October 22, 2018, 8:26am

Just adding my view as one of the application developers:

From V4 to V5 was a big step. Changing COHORT_CONCEPT_ID to COHORT_DEFINITION_ID, and changing VOCABULARY_ID from an INT to a VARCHAR for example required adaptions throughout all the R packages.

From V5 to V6 as far as I can see is mostly that I’ll need to use the DATE_TIME fields everywhere I was using the DATE field. (But I still need to review the changes in full).

All other changes in between V4 and V6 have had no impact on any of my applications. So I agree with @Christian_Reich that even though a lot of the changes broke backwards compatibility, from the Methods Library point of view this did not matter.

Having some machine-readible representation of the changes would be nice, especially as the CDM evolution seems to pick up an incredible pace.I don’t think this machine-readable representation needs to be terribly complex. Some sort of versioning on individual fields would already do the trick.

Chris_Knoll · October 22, 2018, 4:42pm

Ok, fair point. Let’s dissect:

To answer this question, I said that the invention of ‘semantic versioning’ is designed to solve this problem. But with respect to the idea:

I call that ‘the job’.

How would I discover this? Take cohort definitions for example. it touches almost every table, almost every column in the cdm. I’m expected to look up each table and each field for a specific concept and somehow persist in some application configuration about the set of tables/fields that I depend on? Compared to saying ‘This works for CDM versions 5.0 - 5.7’? No thanks!

I have 2 questions:

Is the idea to abandon versioning of the CDM all together in favor of just updating the CDM ‘information model’ at any time via the introduction of new concepts?
Would you be able to go through the process of identifying all tables/fields being used in 2 or 3 of our studies defined in the GitHub - OHDSI/StudyProtocols: Repository of OHDSI Collaborative Research Protocols repository as a ‘dry run’ to just get an idea of the level of effort it would require to extract the CDM dependencies from some R code? I’m assuming ‘validation of compatibility’ would involve ensuring that all the table/field concepts that were extracted from the study code matches all the table/field concepts that were inserted into the cdm ‘meta’ or ‘fact_relationship’ table.

-Chris

gregk · October 22, 2018, 6:04pm

Dived a bit more into Christian’s proposal. Here is a potential use case I can see.

Today, when we are designing cohorts in ATLAS - there is a need to show a warning message that a certain feature only works on a certain OMOP CDM version. I can see a solution where we could wrap up the “CDM Dictionary Ontology” with a WebAPI service that would return all tables and fields used for for a very specific version that we can use during design time to warn users. Like here.

Or even hide features if a specific version is selected. So, here it is - please consider this.

From a big picture though, I believe in the following - there is a problem prevention and then there is a problem detection. The problem detection is what this proposal seems to be focused on. I believe in prevention and a good governance process around OMOP CDM model going forward - including to prevent incompatible changes between minor versions - is really what is needed at this point. I will do a new post on this topic not to clutter this proposal conversation.

Chris_Knoll · October 22, 2018, 7:08pm

More support for semantic versioning, look at this issue report with a version of the library, how they describe the impact (by saying versions < 2.8.11) and the patched version:
https://github.com/ohdsi/circe-be/network/alert/pom.xml/com.fasterxml.jackson.core:jackson-databind/open

@gregk:
There is an open issue in circe-be to apply a cdmTarget semver expression allowing an author to indicate what target CDM the cohort definition is expected. The idea here was to abort cohort generation if the desired CDM target for the definition was in the range of the actual CDM the generation was going to apply to. However, all that goes out the window if we go to ‘versionless CDMs’.

gregk · October 22, 2018, 7:17pm

Great, Chris - this is the one I was looking for! I think Christian can and is proposing to link all tables/fields to a concept for specific CDM version. If that is true, it should work for this issue. @Christian_Reich - could you please comment on this?

gregk · October 22, 2018, 7:43pm

answering questions from the original post:

Any change to an existing interface (CDM=interface) is a change. The question is if it is backward compatible or not.

BIGINT can continue to support INT. It is a backward compatible patch
Nullable vs. Nullable. Depends which which way. If someone makes nullable field non-null - it is not backward compatible
Field name change - most definitely a incompatible change.

I am bit unclear on how you are planning to track all those changes if all we create the dictionary for is table and field names? Or, are you planning to to also define and store all metadata, including data types and constraints? Sounds like it is quickly becoming quite complicated since at this point it might have the database specific information now (see my other comment below)

If you are creating the CDM Dictionary, we should definitely do both tables and fields.

I am not sure I follow you here. I thought we are working at the logical rather than physical layer here? The names of fields in CDM would be the same for each database? It is an interface, after all.

Hope we are not discussing to change our CDM versioning here. The lifecycle should stay as it exist today (and improve) - I am with Chris here completely. We should continue to version our CDM model as a whole - not the individual fields. Any change in CDM should result in a version change for the whole model - it is a single component. Then for that version, for example 5.3.2, you could have a new metadata dictionary.

Hope this helps.

Christian_Reich · October 22, 2018, 9:48pm

Well, they are not doing the job, Chris.

Assume your application uses 2 fields, and they have the concept_ids 123 and 456

SELECT count(*) 
FROM concept_relationship 
JOIN cdm_source on concept_id_1=cdm_version_concept_id -- new field, or lookup using existing cdm_version
WHERE relationship_id='contained_in_version' 
AND concept_id_2 in (123, 456)

If this query does not return 2 then the current version does not support the same fields 123 and 456 anymore. Otherwise it does.

Well, that requires that the developers each time a new version is out will check whether or not a given release still doesn’t break anything. Yes, you could make them do it. Currently, none of them do, and neither do the developers of ATLAS. The tool just crashes unceremoniously. But if they want to do it - they still can. Nothing is broken!!

Not sure why the resistance against this very simple mechanism that makes life of the developers easier.

No such idea. Just an additional vocabulary structure that tells you which fields are in which version. You don’t have to do anything, we will do that job.

Easy. At the end of the protocol script development, you take all the SQL code you are using. You create a list of all strings that are following “FROM” and “JOIN” and say in the current database you are using to develop:

SELECT concept_id_2 
FROM concept_relationship 
JOIN concept on concept_id=concept_id_2
WHERE concept_id_1=<concept_id of the current version>
AND concept_name in (<the list of field names>)

That’s what I was looking for. Your comments make total sense to me.

For each release, the vocab team would do the diff and create new concepts if tables or fields changed. This can be automated.

No. That’s still in the DDL

Correct. My assumption as well.

Nope. Stays as is. And if you don’t want to do this field chasing, which I thought would make your life easier, you can still continue wihtout any change. So, this proposal is 100% backwards compatible!

Please do. We will discuss the feasibility of Semantic Versioning.

Thanks everybody so far.

Christian_Reich · October 22, 2018, 11:06pm

Look at what @Ajit_Londhe is doing here!!! The Lord sent a hint, guys! He is diffing manually. Would not be necessary. The concepts would tell him what changed.

Ajit_Londhe · October 22, 2018, 11:22pm

Pfft now you tell me, all that manual SQL review

FWIW: in the upcoming metadata proposal (now set for the December CDM WG call), @Vojtech_Huser and I have discussed adding standard concepts for CDM table and fields, where the fields are descendants of specific CDM version releases. Is that where you’re heading as well?

Christian_Reich · October 23, 2018, 12:54am

@Ajit_Londhe:

That’s exactly that. I forgot to tag you and @Vojtech_Huser. I was askinng what constitues “same field”, see all the way on top. But when I put it out I got my hair washed by Messrs. Knoll and Klebanov here.

Chris_Knoll · October 23, 2018, 1:17am

Excellent! I’m happy to see you have it all figured out.