Why are these required and what is the implication of putting NULLs in these fields for all persons? Also what should race_source_concept_id, ethnicity_source_concept_id, and gender_source_concept_id contain if the codes in the source data do not exist in the concept table?
The concept_ids are always required. In every table. Instead of putting NULL put 0. Weâve had this convention forever, mostly because there is the notion that databases are faster with only non-NULL values. May not be true, but doesnât really matter. 0 it is.
The source_concept_ids or source_values can be nullable no problem.
The reason why it does matter is because this line of thinking is impacting our ability to make clean designs in our database structures. so, letâs all agree to not make this point in the future as to not give people âevidenceâ to point at that nulls are somehow slower than 0 values.
Adam, to answer your point specifically: the rationale behind 0 in these columns is that everyone must have some feature of ethnicity/race/gender, so itâs not that someone can be âabsentâ of those values (ie: null means absence of value), but rather you just donât know what the value actually is. In that case, itâs completely appropriate to specify âUnknown Concept (concept_id = 0)â for those columns, and it makes logical sense to do so.
Thatâs not entirely true either. Aggregate functions aside, NULL can negatively affect performance in other ways.
First, use of the IS NULL clause in your WHERE clause means that an index cannot be used for the query, and a table scan will be performed. This can greatly reduce performance.
Third, all things being equal, INNER JOINS are faster than OUTER JOINS. If you know all of your concept_id fields have a value, you can use inner joins. If concept_id fields can have NULLs, you must use outer joins. Considering the number of concept_id fields which must be joined back to the Concept table, that can be a huge performance issue.
However, I agree that performance is not the major issue. The main reason to not have nulls is because it is more accurate and less prone to error.
Given the third point above, if your concept_id field does not have nulls, it is immaterial whether you use an inner join or an outer join. The result will be the same. However, if you have nulls in a join field, inner and outer joins will return different results.
Similarly, a nullable field in a WHERE clause can return different result whether or not you use ISNULL or COALESCE functions.
Because NULL means âUnknownâ, its use adds a level of uncertainty which the explicit use of zero eliminates.
The specific case was the table in question has all nulls in the left table joined to a table with all nulls in the other columnâŚnot really a case weâd experence here, and it was specifically engineered to demonstrate a limitation of the optimizer. But, ok, yes, in that specific case, you get a performance problem.
If you only want records that actually have concept IDs you will need to join on the concept column and then add a WHERE CONCEPT_ID != 0 everyplace. The LEFT join in this case at least is explicit by indicating 'even if you donât find a matching conceptID, return the row.
Same as above, you have to resort to âwhere concept_id != 0â where the prior inner join on the nullable column would have removed those.
Null doesnât mean âunknownâ, it means âno valueâ. When you do a left join and get a null value on the right-side column, the null doesnât mean âI do not know what the value of the right-side column isâ it means 'the right side has no value because the join did not match.
Because of this, it is reasonable that we have a value of âunknown valueâ (concept_id = 0) vs. a ânullâ value which represents âthere is no valueâ. Thereâs a lot of discussion on this here: CONCEPT_ID (to NULL or not to NULL) [THEMIS WG3] - #7 by Chris_Knoll
Not sure what you mean here? Can you provide an example?