One of the items that came out of the OHDSI Symposium Themis Working Group meeting was how to handle flavors of NULL. While not an earth shattering topic, we would like to be more clear as a community on how we handle NULLs.
Our group has prepared a statement specifically about NULLs in CONCEPT_ID columns and are putting it out for review on the forum. We would like to propose adding this statement below to the CDM wiki here in the _CONCEPT_ID cell.
Hi, @ericaVoss,
From a strictly data perspective, NULL represents the âabsence of valueâ. For example, in the Measurement table, we have âVALUE_AS_CONCEPTâ, which I assume based on the absence of mapping on those rows where the VALUE_AS_NUMBER is recorded, the VALUE_AS_CONCEPT would be NULL (ie: a mapping wasnât attempted). Is this perspective in line with the statement made about the Data Model Conventions with respect to CONCEPT_ID 0?
0 is the Vocabularies representation of absence of a mapping as it stands for âNo matching conceptâ. If you get garbage or a blank it should map to 0 in a CONCEPT_ID column. Also, NULL can be a devil and including it in a column heavily used for analysis is probably not a good idea.
THEMIS is trying to assert certain CDM data model conventions and this is part of that work. @Chris_Knoll, could you point me specifically do Data Model Convention you think this flies against? Iâm not sure where you are looking.
Hi, @ericaVoss,
I was just seeking a clarification about using CONCEPT_ID = 0 in the column 'VALUE_AS_CONCEPT" in the MEASUREMENT table. If the MEASUREMENT row only has a value as a number (Ie: they recorded a heart rate), would you would leave VALUE_AS_CONCEPT as NULL in this case? It would allow us to find those measurements that are recorded as numeric values as 'select count(*) from MEASUREMENT where VALUE_AS_CONCEPT IS NULL.
There is a MEASUREMENT.VALUE_AS_CONCEPT_ID column, if the raw source gives us categories (e.g. HIG, POS, NEG) we will map those values to CONCEPT_IDs. For example âHIGâ or âHâ will map to 4328749. Else it is mapped to 0. In Janssen CDMs no _CONCEPT_ID columns should have a value of NULL if they do that is a bug in BUILDER.
Very formal language. Why not say that fields with the suffix
â_concept_idâ are mandatory. Set the value to 0 âNo matching conceptâ when
a mapping to a Standard concept cannot be found.
However, the flavor of NULLs is a different topic. Taking from PCORnet
definitions.
Missing or Unknown Data Values
The PCORnet CDM will use the HL7 conventions of âNull Flavorsâ ( http://hl7.org/implement/standards/fhir/v3/NullFlavor/) as a basis for
representing missing or unknown values. Specifically,
for fields where an enumeration is present (i.e., a categorical set of
values), we will populate null values as follows:
A data field is not present in the source system. (populate with null)
A data field for an enumeration is present in the source system, but the
source value is null or blank. (populate with NI=No Information)
A data field for an enumeration is present in the source system, but the
source value explicitly denotes an unknown value. (populate with UN=Unknown)
A data field for an enumeration is present in the source system, but the
source value cannot be mapped to the CDM. (populate with OT=Other)
Which would require 3 or 4 new concepts in addition to using 0 to define
âUnknown valuesâ.
Not advocating for a change just providing information.
Hey, @ericaVoss, Thanks for talking this over with me offline.
My misunderstanding was how we see measurement values come across in the real world vs. a theoretical situation where a measurement of a particular type always being recorded as a number (not categorical) and thus there would be no value to map over to a categorical concept (stored in âvalue_as_conceptâ) so I was arguing the use of NULL in that case. But, from an analytical perspective, storing value_as_concept_id = 0 is no different from storing a null because thereâs not a analytical case where youâd want to process unknown/missing values.
I agree that we donât want to get hung up in the various flavors of ânullâ. When we see a 0 in the value_as_concept_id, weâll be able to refer to the âvalue_source_valueâ to understand what the original value was from the source. And to make this the convention for all *_CONCEPT fields, feels awkward, but I canât argue that there is a critical problem with that approach that will cause system failure.
Iâm drawing the line, tho when we make 000-00-00 the âunknown or missing dateâ value!
Aside from the discussion on the CONCEPT_ID fields, I just want to address some of the statements around NULLs:
##Nulls are wierd
If you consider that null is âthe absence of valueâ, then all the weird results make complete sense.
If I ask you True or False: A and B, A is TRUE.
You say: âWhatâs the value of B?â
Me: âB has no value. It is neither true nor falseâ
You: âThen how can that expression be TRUE or FALSE if I donât have the value of B?â
Me: "Now youâre thinking like a database! There is no value of A and B when B is null! The answer is NULL, and therefore, not TRUE, which is why those kind of records are dropped from the results.
Same thing as NULL == NULL is not true: you canât compare a value of something that has no value. Thatâs why we have âIS NULLâ.
All databases, except for Oracle(as usual), would logically resolve âBig â + null + "Birdâ as null. Same example as above: Concatenating 3 terms: âBig â, a non-value, and âBirdâ can not be done because you canât combine a value with a non-value to get something. Oracle (sigh) does an implicit coalesce(x,ââ) on those terms so your null becomes a ââ, but that does not mean that null is an empty string!
So, On point 1: NULLs are not weird. they have a specific meaning that, in my opinion, should be fully understood and embraced if you are designing a database model (such as the CDM).
Joins on NULL
When you perform a join on a column with NULL values in it, they drop out when you join to another column. A non-NULL value will keep all the records.
So, here is my counter example why this is desirable: if you have 2 relationships REL_A and REL_B which have an optional relationship to a single concept_id, and you want to know which from REL_A share the concept_id from REL_B, youâd do this:
FROM REL_A a
JOIN REL_B b on a.class_concept_id = b.class_concept_id
Those things in A and B which do not have a class_concept_id wonât be brought back in the join, because, from point 1, NULL, a non-value, never matches anything (including another null)
However, if i put â0â for the optional relationship to mean âthereâs no relationshipâ then I have to change my query:
FROM REL_A a
JOIN REL_B b on a.class_concept_id = b.class_concept_id
WHERE a.class_concept_id != 0
So youâll find yourself doing âwhere concept_id != 0â all over the place where you didnât want the non-valueâd rows to be returned.
I have not heard this, and before we dissuade the use of nulls, we should have some empirical data to back that claim up. Left Joins are potential performance issues on MPP systems (due to that all nodes needs to be scanned for the complete set of records), but the presence of a null fieldâŚI havenât heard any issues with that.
Ok, so thatâs a lot of spam, and in the end what @ericaVoss described with concept_ids makes sense, I just want to make sure that ânull-aversionâ doesnât cloud anyoneâs judgement especially with respect to âoptional fieldsâ. If itâs optional, let it be null!
I like to think that SOURCE_VALUES columns store the verbatim source information; if the raw data had NULL this is where you would see the NULL. The THEMIS group is suggesting CONCEPT_IDs should be NOT NULL as 0 is our representation of âNo matching conceptâ.
@Chris_Knoll I think you are in agreement now. However would you like to word the statement slightly differently to be more clear?
Iâd drop this. The ONLY difference is how oracle implicitly coalesceâs NULLS to ââ for string concatenation. The tone of the statement âNULLs shoudl not be usedâŚâ I think would lend people to make all âno valueâ or âoptionalâ fields into special numeric values, which is bad. NULLs, for purposes of boolean operations and joins, behaves consistently across platforms. You should be using them where appropriate.
So yes, we are in agreement specifically for concept_ids. I just want to make it clear it is not my position that âoptionalâ fields should be coded with a special âemptyâ value in other cases.
Iâd like to jump in with a slightly unrelated topic:
we do have situations where some codes get deprecated over time. Then youâll get _concept_id that is no longer standard. What will you do in the next ETL iteration? Will your ETL-scripts automatically convert them into 0 or will you go and ask for the new mapping? Erica is great in doing the latter , and itâs obviously the right way. Still wondering how other people do this though.
BTW,
âIf you get garbage or a blank it should map to 0 in a CONCEPT_ID column.â
Arenât you, @ericaVoss removing nonsense records? I came across a dataset where the overall amount of such was pretty high so we had to partially remove them. But since weâre adding surveys and other things that may have never existed in the vocabularies before Iâm thinking of creating a logic or a heuristic to throw out the garbage and save records that may be useful in the future.
@aostropolets, if there is consistent garbage we do trash it. For example, one of our data vendors if a procedure column is blank it is filled with â000â or â0000â - we used to bring it across but it was a ton of non-value records. We also see consistent incorrect NDCs like â00000000000â that we trash as well. Otherwise it comes into the CDM. Most data comes across unless we are sure it is nonsense.
Except âmapping to a Standard concept cannot be foundâ should be changed to âmapping to an OMOP supported concept cannot be foundâ.
Sometimes old data needs to be mapped to deprecated concepts. Example: a person took Drug X in 2016. Drug X was taken off the market in 2017. RxNorm deprecated the code for Drug X when it was taken off the market. Then OMOP updated their vocabularies and changed the designation of Drug X from a standard concept to a deprecated concept. Person still took Drug X and we want to keep the Drug_Exposure data.
Thatâs not how it works. It stays in there. RxNorm deprecates the concept only if it has a problem. The market has nothing to do with it. So, no non-Standard Concepts in the tables please. NDCs etc. will be remapped to the new correct Standard Concept.
But we do have that problem where source Concepts are also Standard, e.g. CPT4s. Here, they do deprecate it, and then suddenly you have nowhere to go. @Dymshyts is presenting the solution at the next CDM WG. Please come.
Foreign key into the Standardized Vocabularies (i.e. the standard_concept attribute for the corresponding term is true), which serves as the primary basis for all standardized analytics. For example, condition_concept_id = 31967 contains reference value for SNOMED concept of 'Nausea'. Set the value to 0 'No matching concept' when a mapping to a standard concept cannot be found.
@aostropolets We bring in plenty of garbage source values because there is just too much to filter through. Like @ericaVoss says, if they are obvious, then we may remove them, but for many things like measurement values where I honestly canât map them one by one because it would take a lifetime, we bring them into the source value.
I agree with this THEMIS recommendation. We try to follow this for most of our ETL conversions.
Thanks for keeping me honest My fact checker was sleeping on the job and my explanation lacked very important details.
When RxNorm retires a concept, OMOP/RxNorm/powers that be change the standard_concept designation to null and the invalid_reason = U. I still have the retired RxNorm code for Imipramine 10mg enteric coated tablet with RXCUI and concept_code = 310979 in my data. I map to this concept in the drug_source_concept_id field. The drug_concept_id field has drug_concept_id = 778358, Imipramine Hydrochloride 10mg delayed release oral tablet because this is the standard concept for my source RxNorm code.
It sounds like people are comfortable with a 2 part recommendation here:
Submit recommendation to the CDM WG to make all CONCEPT_IDs NOT NULL columns.
update language in the WIKI under CONCEPT_ID
Foreign key into the Standardized Vocabularies (i.e. the standard_concept attribute for the corresponding term is true), which serves as the primary basis for all standardized analytics. For example, condition_concept_id = 31967 contains reference value for SNOMED concept of 'Nausea'. Set the value to 0 'No matching concept' when a mapping to a standard concept cannot be found.