Is CDM_SOURCE convention for vocabulary_version Incorrect?

CSC · March 29, 2019, 4:22pm

As per the CDM pdf the convention for CDM_SOURCE.vocabulary_version is 'The version of the vocabulary can be obtained from the vocabulary_name ﬁeld in the VOCABULARY table for the record where vocabulary_id=‘None’.
But shouldn’t we be considering vocabulary_version field of VOCABULARY table for the record where vocabulary_id=‘None’ instead.

Here’s the Git link :

Thanks in advance,
Chetan

Christian_Reich · March 30, 2019, 2:04pm

If that’s what’s written you are right. But where did you find that “PDF”?

CSC · April 1, 2019, 6:28am

Thanks @Christian_Reich.
You can find the pdf document for CDM v6.0 here. Please refer to page # 27-28 for this.

For CDM v.5.3 (we are using this version in our project) , you can find it here. Page # 33-34 are for CDM_SOURCE.

Regards,
Chetan

Christian_Reich · April 7, 2019, 2:38pm

@CSC: Yeah, it is somewhat ambiguous and needs clarification. The instruction 3) tells the ETLer where to get the vocabulary_version from: The VOCABULARY table. But that table might have changed after the CDM was built from the data. Shouldn’t be the case, but theoretically possible. We need to lay down these rules better.

So, unless folks do funny things the content of CDM_SOURCE.vocabulary_version should be identical to VOCABULARY.vocabulary_version where vocabulary_id=‘None’.

MPhilofsky · April 7, 2019, 3:29pm

Colorado does “funny things”. We update the Vocabulary tables after the CDM is built when we find data missing from the Vocabulary tables. Examples include missing concept_ids, missing relationships from source to standard concept_ids, etc.

Is there a use case to keep the CDM_SOURCE table for vocabulary_version? This is a hard-coded value and the CDM is very dynamic. And the data should always be accurate in VOCABULARY.vocabulary_version

Christian_Reich · April 7, 2019, 5:08pm

Missing? You mean concepts disappear?

MPhilofsky · April 8, 2019, 11:08pm

Yes, that happens. I could search the forums and GitHub for the complaints, but I’m too lazy And these things get corrected quickly.

More importantly, concept_id creation lags behind the use of new concept_codes in our data. This is known and expected. So, we update our Vocabulary tables to enable research on standard concept_ids as the concept_codes are added to the CONCEPT table and the relationships from source concept_ids to standard concept_ids are added in the CONCEPT RELATIONSHIP table.

The most accurate version of the vocabulary will always be located in the VOCABULARY table. The CDM_SOURCE.vocabulary_version might also be accurate, but I wonder how many people update it when they update the Vocabularies.

Christian_Reich · April 9, 2019, 5:30am

Well, do you have an example from your data? Because the Forum etc. posts all resolved to people including or not including vocabularies when they download from Athena.

But you don’t re-run the ETL? That’s kind of dangerous. The Vocab and the CDM should be in sync.

That’s the problem if you store the same thing twice. It creates contradictions. Always does.

MPhilofsky · April 11, 2019, 10:41pm

@Christian_Reich,

You’ve completely hijacked this post

We do and if you want to know more about our ETL, you’ll have to come to the upcoming EHR WG meeting where we discuss it!