CONCEPT_ID (to NULL or not to NULL) [THEMIS WG3]

ericaVoss · February 5, 2018, 6:24pm

One of the items that came out of the OHDSI Symposium Themis Working Group meeting was how to handle flavors of NULL. While not an earth shattering topic, we would like to be more clear as a community on how we handle NULLs.

Our group has prepared a statement specifically about NULLs in CONCEPT_ID columns and are putting it out for review on the forum. We would like to propose adding this statement below to the CDM wiki here in the _CONCEPT_ID cell.

Chris_Knoll · February 5, 2018, 6:30pm

Hi, @ericaVoss,
From a strictly data perspective, NULL represents the ‘absence of value’. For example, in the Measurement table, we have ‘VALUE_AS_CONCEPT’, which I assume based on the absence of mapping on those rows where the VALUE_AS_NUMBER is recorded, the VALUE_AS_CONCEPT would be NULL (ie: a mapping wasn’t attempted). Is this perspective in line with the statement made about the Data Model Conventions with respect to CONCEPT_ID 0?

ericaVoss · February 5, 2018, 6:44pm

0 is the Vocabularies representation of absence of a mapping as it stands for “No matching concept”. If you get garbage or a blank it should map to 0 in a CONCEPT_ID column. Also, NULL can be a devil and including it in a column heavily used for analysis is probably not a good idea.

THEMIS is trying to assert certain CDM data model conventions and this is part of that work. @Chris_Knoll, could you point me specifically do Data Model Convention you think this flies against? I’m not sure where you are looking.

Chris_Knoll · February 5, 2018, 7:03pm

Hi, @ericaVoss,
I was just seeking a clarification about using CONCEPT_ID = 0 in the column 'VALUE_AS_CONCEPT" in the MEASUREMENT table. If the MEASUREMENT row only has a value as a number (Ie: they recorded a heart rate), would you would leave VALUE_AS_CONCEPT as NULL in this case? It would allow us to find those measurements that are recorded as numeric values as 'select count(*) from MEASUREMENT where VALUE_AS_CONCEPT IS NULL.

-Chris

ericaVoss · February 5, 2018, 7:14pm

There is a MEASUREMENT.VALUE_AS_CONCEPT_ID column, if the raw source gives us categories (e.g. HIG, POS, NEG) we will map those values to CONCEPT_IDs. For example ‘HIG’ or ‘H’ will map to 4328749. Else it is mapped to 0. In Janssen CDMs no _CONCEPT_ID columns should have a value of NULL if they do that is a bug in BUILDER.

DTorok · February 5, 2018, 7:29pm

Very formal language. Why not say that fields with the suffix
‘_concept_id’ are mandatory. Set the value to 0 ‘No matching concept’ when
a mapping to a Standard concept cannot be found.

However, the flavor of NULLs is a different topic. Taking from PCORnet
definitions.

Missing or Unknown Data Values
The PCORnet CDM will use the HL7 conventions of “Null Flavors” (
http://hl7.org/implement/standards/fhir/v3/NullFlavor/) as a basis for
representing missing or unknown values. Specifically,
for fields where an enumeration is present (i.e., a categorical set of
values), we will populate null values as follows:

A data field is not present in the source system. (populate with null)
A data field for an enumeration is present in the source system, but the
source value is null or blank. (populate with NI=No Information)
A data field for an enumeration is present in the source system, but the
source value explicitly denotes an unknown value. (populate with UN=Unknown)
A data field for an enumeration is present in the source system, but the
source value cannot be mapped to the CDM. (populate with OT=Other)

Which would require 3 or 4 new concepts in addition to using 0 to define
‘Unknown values’.

Not advocating for a change just providing information.

Chris_Knoll · February 5, 2018, 9:30pm

Hey, @ericaVoss, Thanks for talking this over with me offline.

My misunderstanding was how we see measurement values come across in the real world vs. a theoretical situation where a measurement of a particular type always being recorded as a number (not categorical) and thus there would be no value to map over to a categorical concept (stored in ‘value_as_concept’) so I was arguing the use of NULL in that case. But, from an analytical perspective, storing value_as_concept_id = 0 is no different from storing a null because there’s not a analytical case where you’d want to process unknown/missing values.

I agree that we don’t want to get hung up in the various flavors of ‘null’. When we see a 0 in the value_as_concept_id, we’ll be able to refer to the ‘value_source_value’ to understand what the original value was from the source. And to make this the convention for all *_CONCEPT fields, feels awkward, but I can’t argue that there is a critical problem with that approach that will cause system failure.

I’m drawing the line, tho when we make 000-00-00 the ‘unknown or missing date’ value!

-Chris

Gowtham_Rao · February 6, 2018, 3:03am

Agree, *concept_id should be not null, with 0 as the default value. Null is the

Another reason - I think in MPP systems, null fields degrades performance. And, many (including me), have fallen to the null related problems in SQL.

Chris_Knoll · February 6, 2018, 3:59am

Ok, now you are pushing my buttons

Aside from the discussion on the CONCEPT_ID fields, I just want to address some of the statements around NULLs:

##Nulls are wierd
If you consider that null is ‘the absence of value’, then all the weird results make complete sense.

If I ask you True or False: A and B, A is TRUE.
You say: “What’s the value of B?”
Me: “B has no value. It is neither true nor false”
You: “Then how can that expression be TRUE or FALSE if I don’t have the value of B?”
Me: "Now you’re thinking like a database! There is no value of A and B when B is null! The answer is NULL, and therefore, not TRUE, which is why those kind of records are dropped from the results.

Same thing as NULL == NULL is not true: you can’t compare a value of something that has no value. That’s why we have ‘IS NULL’.

All databases, except for Oracle(as usual), would logically resolve ‘Big ’ + null + "Bird’ as null. Same example as above: Concatenating 3 terms: ‘Big ‘, a non-value, and ‘Bird’ can not be done because you can’t combine a value with a non-value to get something. Oracle (sigh) does an implicit coalesce(x,’’) on those terms so your null becomes a ‘’, but that does not mean that null is an empty string!

So, On point 1: NULLs are not weird. they have a specific meaning that, in my opinion, should be fully understood and embraced if you are designing a database model (such as the CDM).

Joins on NULL

When you perform a join on a column with NULL values in it, they drop out when you join to another column. A non-NULL value will keep all the records.

So, here is my counter example why this is desirable: if you have 2 relationships REL_A and REL_B which have an optional relationship to a single concept_id, and you want to know which from REL_A share the concept_id from REL_B, you’d do this:

FROM REL_A a
JOIN REL_B b on a.class_concept_id = b.class_concept_id

Those things in A and B which do not have a class_concept_id won’t be brought back in the join, because, from point 1, NULL, a non-value, never matches anything (including another null)

However, if i put ‘0’ for the optional relationship to mean ‘there’s no relationship’ then I have to change my query:

FROM REL_A a
JOIN REL_B b on a.class_concept_id = b.class_concept_id
WHERE a.class_concept_id != 0

So you’ll find yourself doing ‘where concept_id != 0’ all over the place where you didn’t want the non-value’d rows to be returned.

I have not heard this, and before we dissuade the use of nulls, we should have some empirical data to back that claim up. Left Joins are potential performance issues on MPP systems (due to that all nodes needs to be scanned for the complete set of records), but the presence of a null field…I haven’t heard any issues with that.

Ok, so that’s a lot of spam, and in the end what @ericaVoss described with concept_ids makes sense, I just want to make sure that ‘null-aversion’ doesn’t cloud anyone’s judgement especially with respect to ‘optional fields’. If it’s optional, let it be null!

ericaVoss · February 6, 2018, 4:18am

I like to think that SOURCE_VALUES columns store the verbatim source information; if the raw data had NULL this is where you would see the NULL. The THEMIS group is suggesting CONCEPT_IDs should be NOT NULL as 0 is our representation of “No matching concept”.

@Chris_Knoll I think you are in agreement now. However would you like to word the statement slightly differently to be more clear?

Chris_Knoll · February 6, 2018, 5:01am

I’d drop this. The ONLY difference is how oracle implicitly coalesce’s NULLS to ‘’ for string concatenation. The tone of the statement ‘NULLs shoudl not be used…’ I think would lend people to make all ‘no value’ or ‘optional’ fields into special numeric values, which is bad. NULLs, for purposes of boolean operations and joins, behaves consistently across platforms. You should be using them where appropriate.

So yes, we are in agreement specifically for concept_ids. I just want to make it clear it is not my position that ‘optional’ fields should be coded with a special ‘empty’ value in other cases.

aostropolets · February 6, 2018, 11:35am

I’d like to jump in with a slightly unrelated topic:
we do have situations where some codes get deprecated over time. Then you’ll get _concept_id that is no longer standard. What will you do in the next ETL iteration? Will your ETL-scripts automatically convert them into 0 or will you go and ask for the new mapping? Erica is great in doing the latter , and it’s obviously the right way. Still wondering how other people do this though.
BTW,
‘If you get garbage or a blank it should map to 0 in a CONCEPT_ID column.’
Aren’t you, @ericaVoss removing nonsense records? I came across a dataset where the overall amount of such was pretty high so we had to partially remove them. But since we’re adding surveys and other things that may have never existed in the vocabularies before I’m thinking of creating a logic or a heuristic to throw out the garbage and save records that may be useful in the future.

ericaVoss · February 6, 2018, 1:26pm

@aostropolets, if there is consistent garbage we do trash it. For example, one of our data vendors if a procedure column is blank it is filled with ‘000’ or ‘0000’ - we used to bring it across but it was a ton of non-value records. We also see consistent incorrect NDCs like ‘00000000000’ that we trash as well. Otherwise it comes into the CDM. Most data comes across unless we are sure it is nonsense.

MPhilofsky · February 6, 2018, 6:27pm

I agree with @DTorok

Except “mapping to a Standard concept cannot be found” should be changed to “mapping to an OMOP supported concept cannot be found”.

Sometimes old data needs to be mapped to deprecated concepts. Example: a person took Drug X in 2016. Drug X was taken off the market in 2017. RxNorm deprecated the code for Drug X when it was taken off the market. Then OMOP updated their vocabularies and changed the designation of Drug X from a standard concept to a deprecated concept. Person still took Drug X and we want to keep the Drug_Exposure data.

Christian_Reich · February 6, 2018, 6:43pm

That’s not how it works. It stays in there. RxNorm deprecates the concept only if it has a problem. The market has nothing to do with it. So, no non-Standard Concepts in the tables please. NDCs etc. will be remapped to the new correct Standard Concept.

But we do have that problem where source Concepts are also Standard, e.g. CPT4s. Here, they do deprecate it, and then suddenly you have nowhere to go. @Dymshyts is presenting the solution at the next CDM WG. Please come.

ericaVoss · February 6, 2018, 6:46pm

Maybe this THEMIS recommendation has two parts:

Submit recommendation to the CDM WG to make all CONCEPT_IDs NOT NULL columns.
update language in the WIKI under CONCEPT_ID

Foreign key into the Standardized Vocabularies (i.e. the standard_concept attribute for the corresponding term is true), which serves as the primary basis for all standardized analytics. For example, condition_concept_id = 31967 contains reference value for SNOMED concept of 'Nausea'. Set the value to 0 'No matching concept' when a mapping to a standard concept cannot be found.

mvanzandt · February 7, 2018, 12:10am

@aostropolets We bring in plenty of garbage source values because there is just too much to filter through. Like @ericaVoss says, if they are obvious, then we may remove them, but for many things like measurement values where I honestly can’t map them one by one because it would take a lifetime, we bring them into the source value.

I agree with this THEMIS recommendation. We try to follow this for most of our ETL conversions.

MPhilofsky · February 7, 2018, 7:04pm

@Christian_Reich,

Thanks for keeping me honest My fact checker was sleeping on the job and my explanation lacked very important details.

When RxNorm retires a concept, OMOP/RxNorm/powers that be change the standard_concept designation to null and the invalid_reason = U. I still have the retired RxNorm code for Imipramine 10mg enteric coated tablet with RXCUI and concept_code = 310979 in my data. I map to this concept in the drug_source_concept_id field. The drug_concept_id field has drug_concept_id = 778358, Imipramine Hydrochloride 10mg delayed release oral tablet because this is the standard concept for my source RxNorm code.

Christian_Reich · February 7, 2018, 8:04pm

@MPhilofsky:

Perfect.

ericaVoss · February 9, 2018, 8:36pm

I’m poking the sleeping bear . . .

It sounds like people are comfortable with a 2 part recommendation here:

Submit recommendation to the CDM WG to make all CONCEPT_IDs NOT NULL columns.
update language in the WIKI under CONCEPT_ID

Foreign key into the Standardized Vocabularies (i.e. the standard_concept attribute for the corresponding term is true), which serves as the primary basis for all standardized analytics. For example, condition_concept_id = 31967 contains reference value for SNOMED concept of 'Nausea'. Set the value to 0 'No matching concept' when a mapping to a standard concept cannot be found.

Anyone have more feedback . . . .