CONCEPT_ID (to NULL or not to NULL) [THEMIS WG3]

clairblacketer · March 8, 2018, 8:06pm

@Chris_Knoll I really think for our CDM purposes at Janssen most of what we have are unmapped values; we very rarely get any NULL values from our sources. As you know, many of our data assets are US based claims or EHR systems which are centered around billing and it is extremely unlikely that we will see a NULL value anywhere in a field that relates to how a provider bills for a claim. The only place we ever really see NULLs would be somewhere related to an actual billable field, like measurement units or things like provider specialty, etc.

One of the benefits of our current approach to using 0s for NULLs and unmapped values is that we don’t have to tell the builder what to look for when mapping to zero, we can just say “map everything you can and anything you can’t, map to 0”. If we were decided to map all NULL values from the source to a NULL concept_id, we would then have to parse through the source each time we received an update so we could accurately tell builder what is a NULL and what isn’t. Considering the infrequency with which NULLs are actually present, this seems like it would take more time than it would save us in the long run.

Chris_Knoll · March 8, 2018, 8:24pm

So I just checked TruvenCCAE (v697) for counts of rows that have 0 as value_as_concept, and here are the top 5:

value_source_value	count 
NULL			2347857288
No			12968308
YES			2898728
0 drinks		2584288
Very good		1607292

So we see here, NULLS outpace the next unmapped source value of ‘No’ by a factor of 200. What query were you using to determine that there were rarely null values from the source? Maybe I’m looking at the wrong DB.

But that’s the job of the ETL logic, no?

All I want out of this Themis discussion is that when I look at a row, I can confidently say ‘there was no value in this column from the source’. I do not want to have to dig into ETL rules about how many different ways you can represent the absence of values (ie: ‘’, ‘NULL’, ‘NA’, ‘N/A’, ‘Not Provided’, ‘NP’, etc. the list goes on, and if there is a case where I want to analyze a data source on their capture of a measurement value that should have been stored as a categorical value (mapped to a concept) I’d just like to look for those that are not null vs null so I can understand the capture vs having to figure out if the value was provided or not through interpreting the value_as_source. And my interpretation on one source will not follow through to another source (unless it’s standardized via a Themis statement of standardization).

clairblacketer · March 8, 2018, 8:25pm

@Chris_Knoll I was referring to billable fields. These are ICD9/ICD10 codes, HCPCS, CPT4, and NDC codes to name the most popular. I would expect there to be NULLs in value fields because many measurements that we receive do not have values associated.

Chris_Knoll · March 8, 2018, 8:30pm

Right! so billable fields makes sense to have not null because you expect them to have some required value.

Are you in agreement that columns such some of those found in the measurement table (value_as_concept_id) should allow nulls?

ericaVoss · March 8, 2018, 8:41pm

While I still believe VALUE_AS_CONCEPT_ID, UNIT_CONCEPT_ID, and OPERATOR_CONCEPT_ID should be NOT NULL - they are still open for debate on tomorrow’s THESMIS F2F so let’s table this for now.

I think Clair/I are mostly concerned with items like CONDITION_CONCEPT_ID, DRUG_CONCEPT_ID.

This entire conversation is being covered at THEMIS this week.

Chris_Knoll · March 8, 2018, 8:44pm

I apologize if I ever said that fields like CONDITION_CONCEPT_ID and DRUG_CONCEPT_ID should allow nulls. That was never my intent. My concern here is related to this aversion to using null columns, and how it could impact future decisions about the CDM structure. Looking forward to the debate tomorrow.

ericaVoss · April 24, 2018, 8:57pm

In an attempt to document what was stated at the THEMIS F2F here is where I think we ended up on this topic:

RULE
Foreign key into the Standardized Vocabularies (i.e. the standard_concept attribute for the corresponding term is true), which serves as the primary basis for all standardized analytics. For example, condition_concept_id = 31967 contains reference value for SNOMED concept of 'Nausea'. Set the value to 0 'No matching concept' when a mapping to a standard concept cannot be found (except for MEASUREMENT/OBSERVATION.VALUE_AS_CONCEPT_ID, MEASUREMENT/OBSERVATION.UNIT_CONCEPT_ID, MEASUREMENT.OPERATOR_CONCEPT_ID which can be NULL if data does not contain a value).

ACTION
1) Submit recommendation to the CDM WG to make all CONCEPT_IDs NOT NULL columns (except for MEASUREMENT/OBSERVATION.VALUE_AS_CONCEPT_ID, MEASUREMENT/OBSERVATION.UNIT_CONCEPT_ID, MEASUREMENT.OPERATOR_CONCEPT_ID which can be NULL if data does not contain a value) . 2) Update language in the WIKI under CONCEPT_ID https://github.com/OHDSI/CommonDataModel/wiki/Data-Model-Conventions

ericaVoss · June 25, 2018, 4:25pm

This has quieted down, any last thoughts?!

MaximMoinat · October 17, 2018, 9:52am

Sorry for re-joining this late, but I was just looking at the new CDM v6 and noticed that the _source_concept_id fields are all made NOT NULL. I don’t want to argue about this rule, overall it is fine for us. However, I am wondering what the rationale behind this change was. I assumed this Themis issue was only about the ‘regular’ concept ids, like condition_concept_id. I can’t find any reference to the source concept ids in any CDM or Themis issues.

Was it this Themis topic that ‘made’ the source concept id required? If yes, what is the idea behind it?

Dymshyts · October 18, 2018, 11:45am

Good question.
If we have some custom concepts and make store their mappings in source_to_concept_map, we keep _source_concept_id NULL.
So _source_concept_id should be nullable.

ericaVoss · October 24, 2018, 6:29pm

Yes, this was one of the THEMIS rules discussed during the F2f

github.com/OHDSI/Themis

CONCEPT_ID (to NULL or not to NULL)

opened 09:00PM - 24 Apr 18 UTC

closed 06:24PM - 24 Oct 18 UTC

ericaVoss

Incorporated! Ratified! WG#3

TYPE | NOTES -- | -- ITEM | When should CONCEPT_ID be NULL and when should it …be 0? FORUM POST | http://forums.ohdsi.org/t/concept-id-to-null-or-not-to-null-themis-wg3/3965/46 SOLUTION | Foreign key into the Standardized Vocabularies (i.e. the standard_concept attribute for the corresponding term is true), which serves as the primary basis for all standardized analytics. For example, condition_concept_id = 31967 contains reference value for SNOMED concept of 'Nausea'. Set the value to 0 'No matching concept' when a mapping to a standard concept cannot be found (except for MEASUREMENT/OBSERVATION.VALUE_AS_CONCEPT_ID, MEASUREMENT/OBSERVATION.UNIT_CONCEPT_ID, MEASUREMENT.OPERATOR_CONCEPT_ID which can be NULL if data does not contain a value). NEXT STEPS | 1) Submit recommendation to the CDM WG to make all CONCEPT_IDs NOT NULL columns (except for MEASUREMENT/OBSERVATION.VALUE_AS_CONCEPT_ID, MEASUREMENT/OBSERVATION.UNIT_CONCEPT_ID, MEASUREMENT.OPERATOR_CONCEPT_ID which can be NULL if data does not contain a value). <br/><br/> 2) Update language in the WIKI under CONCEPT_ID <br />https://github.com/OHDSI/CommonDataModel/wiki/Data-Model-Conventions

While folks definitely still disagree on this topic (as you can see from the above thread) the finally agreed upon consensus was CONCEPT_ID columns should be a CONCEPT_ID or 0 for everything except MEASUREMENT/OBSERVATION.VALUE_AS_CONCEPT_ID, MEASUREMENT/OBSERVATION.UNIT_CONCEPT_ID, MEASUREMENT.OPERATOR_CONCEPT_ID which can be NULL if data does not contain a value.