As per the CDM pdf the convention for CDM_SOURCE.vocabulary_version is 'The version of the vocabulary can be obtained from the vocabulary_name field in the VOCABULARY table for the record where vocabulary_id=‘None’.
But shouldn’t we be considering vocabulary_version field of VOCABULARY table for the record where vocabulary_id=‘None’ instead.
@CSC: Yeah, it is somewhat ambiguous and needs clarification. The instruction 3) tells the ETLer where to get the vocabulary_version from: The VOCABULARY table. But that table might have changed after the CDM was built from the data. Shouldn’t be the case, but theoretically possible. We need to lay down these rules better.
So, unless folks do funny things the content of CDM_SOURCE.vocabulary_version should be identical to VOCABULARY.vocabulary_version where vocabulary_id=‘None’.
Colorado does “funny things”. We update the Vocabulary tables after the CDM is built when we find data missing from the Vocabulary tables. Examples include missing concept_ids, missing relationships from source to standard concept_ids, etc.
Is there a use case to keep the CDM_SOURCE table for vocabulary_version? This is a hard-coded value and the CDM is very dynamic. And the data should always be accurate in VOCABULARY.vocabulary_version
Yes, that happens. I could search the forums and GitHub for the complaints, but I’m too lazy And these things get corrected quickly.
More importantly, concept_id creation lags behind the use of new concept_codes in our data. This is known and expected. So, we update our Vocabulary tables to enable research on standard concept_ids as the concept_codes are added to the CONCEPT table and the relationships from source concept_ids to standard concept_ids are added in the CONCEPT RELATIONSHIP table.
The most accurate version of the vocabulary will always be located in the VOCABULARY table. The CDM_SOURCE.vocabulary_version might also be accurate, but I wonder how many people update it when they update the Vocabularies.
Well, do you have an example from your data? Because the Forum etc. posts all resolved to people including or not including vocabularies when they download from Athena.
But you don’t re-run the ETL? That’s kind of dangerous. The Vocab and the CDM should be in sync.
That’s the problem if you store the same thing twice. It creates contradictions. Always does.