OHDSI Home | Forums | Wiki | Github

CONCEPT_ID (to NULL or not to NULL) [THEMIS WG3]

I feel like this complicates thing that you need to interpret the 0 in the unit_concept_id as there wasn’t an actual unit by looking at the unit_source_value for null (and what if we have one of those sources that are affraid of NULLs and instead they store empty strings (’’) in their data? then we have to check for null and empty string to understand if there was even a value that was mapped from?

Why not just put null in the columns where there wasn’t a source value to map from? That’s what null represents: the absence of value.

I personally prefer the consistency of CONCEPT_ID is a concept (0 representing the absence of a map) and then store the true raw value in the SOURCE_VALUE column. We aren’t going to perform standard analytics on NULL. At Janssen we have been doing this for years and never had a use case for storing a NULL in CONCEPT_ID. The only time I utilize the NULLs from the raw data is for ETL debugging and the SOURCE_VALUE does just find for this.

Will we be performing standard analytics on 0?

Not anything more than using it for reviewing what got mapped to 0, which @Ajit_Londhe, @clairblacketer, and I do do. If something didn’t map we don’t care if it the raw data had NULL, ‘’, or garbage - we review to figure out if we can fix it. 0 tells us in a standard way that we didn’t map to standard terminology and having the NULL is in the SOURCE_VALUE you can always use this if review what the raw data was.

This is the essence that @MaximMoinat, @Christian_Reich and myself have been saying: 0 implies you got something that failed to map. Null means there wasn’t anything in the first place. This isn’t about analytical use cases, this is about what the data represents. I’m surprised that you don’t see the benefit of 0 vs. null…if you just wanted to know what didn’t have a map, and you allow nulls in those cases, then searching for 0 means you’ll find everything that was attempted to map, but didn’t. I’m guessing that the ‘valid source but no concept in the vocabulary’ is much less prevalent vs no value to map. I’m guessing that you are reviewing a lot of needless rows that have concept 0 when you could have just ignored them outright because there was nothing mapped for that row anyways. Of course, the source_value column could help you refine your search, but strictly speaking about the concept column…If I see a 0 of an unknown concept, i’m assuming that there was some value there that failed to map over to a CDM concept…

Yes I agree and I still argue why would you differentiate between NULL, ‘’, ’ ', or an unmapped source code. What happens when the raw value actually sends text ‘NULL’ or ‘NA’, in that case I would say 0 but a NULL value is NULL? Just seems inconsistent.

In your example of searching for CONCEPT_ID IS NULL this wouldn’t even reliably bring back values you wanted unless the ETL has additional logic to discerning between “true” NULLS and values not mapped. We’d need logic to handle “flavors of NULL” which I don’t see value in.

I think it is much clearer and consistent to say CONCEPT_ID is 0 when not mapped and store what the raw data gave you in the SOURCE_VALUE column.

One example where I think NULL doesn’t work is if I’m linking DRUG_EXPOSURE to the Vocabulary I would automatically lose records when CONCEPT_ID IS NULL on a JOIN. You would be blind to the fact that there are rows that didn’t map with that JOIN.

It really depends on what you mapping.

Well after 5+ years of doing 13+ CDMs that don’t have NULLs in CONCEPT_ID I haven’t run into a problem or a need for it to be otherwise. :slight_smile:

I think that’s the best example of NULL vs. 0. If they are telling you it’s not a value or NA (which I’m assuming means ‘not applicable’ then saying that it is not a value is being consistent with the original data.

I’m sorry, I don’t understand this point. But you could substitute ‘flavors of unmapped concepts’ and your statement would make my point.

This is what a left join is for. And it’s explicitly to say ‘Bring me back rows even if their is no value in the right side table.’ The counter to your example is: you do a join on a table and get records back that you didn’t want, and you’ll have to exclude concept IDs being 0, but you don’t know if that 0 meant that there wasn’t a value vs. it was something that was unable to map…

I know you’re just being silly so I won’t comment on that! :wink:

If I may throw a technical point to this argument of NULL or not to NULL.

DBMS with MPP architecture apparently dont like NULLs. Apparently NOT NULL columns perform significantly better compared to NULLable columns. JOINS and WHERE work significantly better with NOT NULL columns. You ask me for reference – i heard this someplace and it was stored in my memory some place. Maybe @Chris_Knoll knows! Maybe it is how the ‘distribute on’ works - maybe MPP cant distribute NULLs. Also - having not gone to formal SQL school, i learnt the hard way that NULL is not equal to NULL

I am in favor of some flavor of integer – maybe negative integers to represent flavors of NULL, like -1, -2, -3 etc because all our concept_id’s are positive integers anyways… So we can clearly define each flavor of special information without any ambiguity and develop conventions on when to use what. This could be then in the OMOP vocabulary as an OMOP-generated concept_id with a vocabulary-id called ‘Special Values’, and domain ‘Special Values’, and concept_class_id ‘NULL flavors’. Then all negative value concept_id’s are some form of special concept’s and we could support a lot of such special situations using OMOP vocabulary.

This may make writing parameterized SQL and R packages easier - because we dont have to do stuff like

where concept_id in (33423, 32432342) or concept_id is null

we can just do where

where concept_id in (33423, 32432342, -1)
1 Like

I have not heard this, and have done lots of work in optimizing queries on mpp systems. The only place where MPP performance suffers is in the use of left joins (which doesn’t have to do with nullable columns). In the use of a Left join. All records from the left table (which may be distributed across multiple nodes) have to be joined to all records found in the right side. This causes data movement across nodes which is what you want to avoid in MPP environments. If it was not a left join, then you can leverage partitioning information from the ON clause to limit the rows compared to records living in the same node.

Based on azure guidance, you use hashing on columns that have no or very few null values. But this is for partitioning guidance (NULLs will go to the same partition, but this is same thing as all 0’s going to the same partition). But nothing in there says ‘null columns slow down performance’. You just need to be smart about what you partition on. Ref: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute.

The first one reads 'give me rows with these specified concepts or nothing at all.

The second one reads ‘give me rows with these specified concepts…and I don’t know what the last one refers to except it’s an inverse sign’.

I think the topic of ‘flavors of missing data’ is conflating the primary point here of NULL vs. 0. The null is very binary: Is there a value or is there not a value, that is the question. Null is very clear: there is no value. If you use 0, then you have to refer to other columns to understand what this 0 means, and you may even have to dig back into the ETL logic about how they decided to code 0 for those cases of ‘’, ‘NULL’ and ‘NA’ that @ericaVoss mentioned. I might be misunderstanding something, but isn’t the point of Themis is that you don’t have to go back to ETL logic on a study-by-study basis, and instead there are expectations that can be assumed from an ETL that follows the Themis standards?

@Chris_Knoll I really think for our CDM purposes at Janssen most of what we have are unmapped values; we very rarely get any NULL values from our sources. As you know, many of our data assets are US based claims or EHR systems which are centered around billing and it is extremely unlikely that we will see a NULL value anywhere in a field that relates to how a provider bills for a claim. The only place we ever really see NULLs would be somewhere related to an actual billable field, like measurement units or things like provider specialty, etc.

One of the benefits of our current approach to using 0s for NULLs and unmapped values is that we don’t have to tell the builder what to look for when mapping to zero, we can just say “map everything you can and anything you can’t, map to 0”. If we were decided to map all NULL values from the source to a NULL concept_id, we would then have to parse through the source each time we received an update so we could accurately tell builder what is a NULL and what isn’t. Considering the infrequency with which NULLs are actually present, this seems like it would take more time than it would save us in the long run.

So I just checked TruvenCCAE (v697) for counts of rows that have 0 as value_as_concept, and here are the top 5:

value_source_value	count 
NULL			2347857288
No			12968308
YES			2898728
0 drinks		2584288
Very good		1607292

So we see here, NULLS outpace the next unmapped source value of ‘No’ by a factor of 200. What query were you using to determine that there were rarely null values from the source? Maybe I’m looking at the wrong DB.

But that’s the job of the ETL logic, no?

All I want out of this Themis discussion is that when I look at a row, I can confidently say ‘there was no value in this column from the source’. I do not want to have to dig into ETL rules about how many different ways you can represent the absence of values (ie: ‘’, ‘NULL’, ‘NA’, ‘N/A’, ‘Not Provided’, ‘NP’, etc. the list goes on, and if there is a case where I want to analyze a data source on their capture of a measurement value that should have been stored as a categorical value (mapped to a concept) I’d just like to look for those that are not null vs null so I can understand the capture vs having to figure out if the value was provided or not through interpreting the value_as_source. And my interpretation on one source will not follow through to another source (unless it’s standardized via a Themis statement of standardization).

@Chris_Knoll I was referring to billable fields. These are ICD9/ICD10 codes, HCPCS, CPT4, and NDC codes to name the most popular. I would expect there to be NULLs in value fields because many measurements that we receive do not have values associated.

Right! so billable fields makes sense to have not null because you expect them to have some required value.

Are you in agreement that columns such some of those found in the measurement table (value_as_concept_id) should allow nulls?

While I still believe VALUE_AS_CONCEPT_ID, UNIT_CONCEPT_ID, and OPERATOR_CONCEPT_ID should be NOT NULL - they are still open for debate on tomorrow’s THESMIS F2F so let’s table this for now.

I think Clair/I are mostly concerned with items like CONDITION_CONCEPT_ID, DRUG_CONCEPT_ID.

This entire conversation is being covered at THEMIS this week.

I apologize if I ever said that fields like CONDITION_CONCEPT_ID and DRUG_CONCEPT_ID should allow nulls. That was never my intent. My concern here is related to this aversion to using null columns, and how it could impact future decisions about the CDM structure. Looking forward to the debate tomorrow.

In an attempt to document what was stated at the THEMIS F2F here is where I think we ended up on this topic:

RULE
Foreign key into the Standardized Vocabularies (i.e. the standard_concept attribute for the corresponding term is true), which serves as the primary basis for all standardized analytics. For example, condition_concept_id = 31967 contains reference value for SNOMED concept of 'Nausea'. Set the value to 0 'No matching concept' when a mapping to a standard concept cannot be found (except for MEASUREMENT/OBSERVATION.VALUE_AS_CONCEPT_ID, MEASUREMENT/OBSERVATION.UNIT_CONCEPT_ID, MEASUREMENT.OPERATOR_CONCEPT_ID which can be NULL if data does not contain a value).

ACTION
1) Submit recommendation to the CDM WG to make all CONCEPT_IDs NOT NULL columns (except for MEASUREMENT/OBSERVATION.VALUE_AS_CONCEPT_ID, MEASUREMENT/OBSERVATION.UNIT_CONCEPT_ID, MEASUREMENT.OPERATOR_CONCEPT_ID which can be NULL if data does not contain a value) . 2) Update language in the WIKI under CONCEPT_ID https://github.com/OHDSI/CommonDataModel/wiki/Data-Model-Conventions

2 Likes

This has quieted down, any last thoughts?!

Sorry for re-joining this late, but I was just looking at the new CDM v6 and noticed that the _source_concept_id fields are all made NOT NULL. I don’t want to argue about this rule, overall it is fine for us. However, I am wondering what the rationale behind this change was. I assumed this Themis issue was only about the ‘regular’ concept ids, like condition_concept_id. I can’t find any reference to the source concept ids in any CDM or Themis issues.

Was it this Themis topic that ‘made’ the source concept id required? If yes, what is the idea behind it?

1 Like

Good question.
If we have some custom concepts and make store their mappings in source_to_concept_map, we keep _source_concept_id NULL.
So _source_concept_id should be nullable.

Yes, this was one of the THEMIS rules discussed during the F2f

While folks definitely still disagree on this topic (as you can see from the above thread) the finally agreed upon consensus was CONCEPT_ID columns should be a CONCEPT_ID or 0 for everything except MEASUREMENT/OBSERVATION.VALUE_AS_CONCEPT_ID, MEASUREMENT/OBSERVATION.UNIT_CONCEPT_ID, MEASUREMENT.OPERATOR_CONCEPT_ID which can be NULL if data does not contain a value.

t