Are race and ethnicity required in the CDM v6?

Adam_Black · April 13, 2019, 9:38pm

According to the wiki (https://github.com/OHDSI/CommonDataModel/wiki/PERSON) the following race and ethnicity fields are required for v6 of the CDM.

race_concept_id
race_source_concept_id
ethnicity_concept_id
ethnicity_source_concept_id

Why are these required and what is the implication of putting NULLs in these fields for all persons? Also what should race_source_concept_id, ethnicity_source_concept_id, and gender_source_concept_id contain if the codes in the source data do not exist in the concept table?

Christian_Reich · April 13, 2019, 10:17pm

@Adam_Black:

The concept_ids are always required. In every table. Instead of putting NULL put 0. We’ve had this convention forever, mostly because there is the notion that databases are faster with only non-NULL values. May not be true, but doesn’t really matter. 0 it is.

The source_concept_ids or source_values can be nullable no problem.

Chris_Knoll · April 14, 2019, 9:43pm

This is not true, you can read about some research on it here: https://stackoverflow.com/questions/33916088/what-is-faster-sum-over-null-or-over-0.

The reason why it does matter is because this line of thinking is impacting our ability to make clean designs in our database structures. so, let’s all agree to not make this point in the future as to not give people ‘evidence’ to point at that nulls are somehow slower than 0 values.

Adam, to answer your point specifically: the rationale behind 0 in these columns is that everyone must have some feature of ethnicity/race/gender, so it’s not that someone can be ‘absent’ of those values (ie: null means absence of value), but rather you just don’t know what the value actually is. In that case, it’s completely appropriate to specify ‘Unknown Concept (concept_id = 0)’ for those columns, and it makes logical sense to do so.

roger.carlson · April 16, 2019, 1:40pm

That’s not entirely true either. Aggregate functions aside, NULL can negatively affect performance in other ways.

First, use of the IS NULL clause in your WHERE clause means that an index cannot be used for the query, and a table scan will be performed. This can greatly reduce performance.

Second, in some cases, JOIN performance can be dramatically affected by NULLS. (see: https://sqlperformance.com/2015/11/sql-plan/hash-joins-on-nullable-columns).

Third, all things being equal, INNER JOINS are faster than OUTER JOINS. If you know all of your concept_id fields have a value, you can use inner joins. If concept_id fields can have NULLs, you must use outer joins. Considering the number of concept_id fields which must be joined back to the Concept table, that can be a huge performance issue.

However, I agree that performance is not the major issue. The main reason to not have nulls is because it is more accurate and less prone to error.

Given the third point above, if your concept_id field does not have nulls, it is immaterial whether you use an inner join or an outer join. The result will be the same. However, if you have nulls in a join field, inner and outer joins will return different results.

Similarly, a nullable field in a WHERE clause can return different result whether or not you use ISNULL or COALESCE functions.

Because NULL means “Unknown”, its use adds a level of uncertainty which the explicit use of zero eliminates.

Chris_Knoll · April 16, 2019, 10:21pm

There’s an index for that:

The specific case was the table in question has all nulls in the left table joined to a table with all nulls in the other column…not really a case we’d experence here, and it was specifically engineered to demonstrate a limitation of the optimizer. But, ok, yes, in that specific case, you get a performance problem.

If you only want records that actually have concept IDs you will need to join on the concept column and then add a WHERE CONCEPT_ID != 0 everyplace. The LEFT join in this case at least is explicit by indicating 'even if you don’t find a matching conceptID, return the row.

Same as above, you have to resort to ‘where concept_id != 0’ where the prior inner join on the nullable column would have removed those.

Null doesn’t mean ‘unknown’, it means ‘no value’. When you do a left join and get a null value on the right-side column, the null doesn’t mean ‘I do not know what the value of the right-side column is’ it means 'the right side has no value because the join did not match.

Because of this, it is reasonable that we have a value of ‘unknown value’ (concept_id = 0) vs. a ‘null’ value which represents ‘there is no value’. There’s a lot of discussion on this here: CONCEPT_ID (to NULL or not to NULL) [THEMIS WG3]

Not sure what you mean here? Can you provide an example?

All this aside, I feel this has taken the thread off-topic. You can follow up with this on this thread: CONCEPT_ID (to NULL or not to NULL) [THEMIS WG3] because it is talking specifically about NULL use.

Adam_Black · April 19, 2019, 3:53pm

Thanks for the discussion everyone. @Christian_Reich definitely answered my initial question but this discussion spurred another question about how important various choices in the CDM implementation are for database performance that I posted here: What are the most important affect database performance issues to consider when setting up and OMOP CDM?