From a mapping quality point of view the null/0 distinction is important. For non-required concept_ids like value_as_concept_id, a null represents that there was no attempt made to map a source concept to a standard target concept. A 0 means that no suitable mapping has been found. When generating mapping statistics, we neglect the nulls and only look at the concept_ids >= 0.
So in that perspective I would argue to allow null in those fields.
@MaximMoinat, what are situations where you do not attempt to map and why do you need to keep track of this? How are you using this in an analysis? Or are you only using this for mapping statistics? You still should be able to get the answer you need by using the SOURCE_VALUE columns.
Submit recommendation to the CDM WG to make all CONCEPT_IDs NOT NULL columns.
Update language in the WIKI under CONCEPT_ID
Foreign key into the Standardized Vocabularies (i.e. the standard_concept attribute for the corresponding term is true), which serves as the primary basis for all standardized analytics. For example, condition_concept_id = 31967 contains reference value for SNOMED concept of 'Nausea'. Set the value to 0 'No matching concept' when a mapping to a standard concept cannot be found.
THEMIS WG3 is bringing this up at the THEMIS F2F Day 1 this week.
RULE Foreign key into the Standardized Vocabularies (i.e. the standard_concept attribute for the corresponding term is true), which serves as the primary basis for all standardized analytics. For example, condition_concept_id = 31967 contains reference value for SNOMED concept of 'Nausea'. Set the value to 0 'No matching concept' when a mapping to a standard concept cannot be found.
ACTION
Submit recommendation to the CDM WG to make all CONCEPT_IDs NOT NULL columns.
I would like to add to this. Rather than the vague requirement that the
concept id is a âForeign key into the Standardized Vocabulariesâ the CDM
should define the Domain, and if there are additional constraints on the
concept set the Concept Class. This should be the definition for all
â_concept_idâ columns.
@DTorok I like this idea. I think the âforeign key to a standard conceptâ phrase that is used across the CDM documentation is too vague for comfort and specifying the domain would give a little more structure. This may be a separate THEMIS issue, however (or just something I could tackle myself), as I think what @ericaVoss is most interested in here is setting all *CONCEPT_ID fields to NOT NULL and setting the value to 0 when there is no map to a standard concept.
Sorry for my late reply, @Chris_Knoll actually described this very well already:
So here we consider three concept_id fields that should be allowed to be NULL in the measurement table, as they can be missing:
value_as_concept_id (if the measurement only contains a numeric value)
unit_concept_id (if the measurement has only a concept value)
operator_concept_id
These fields might be left empty because there is no data to fill it with, not because there is âNo matching conceptâ. There is no source concept to match to the OMOP vocabulary.
On the other hand, it does make sense to set the cause_concept_id in the death table to NOT NULL.
Agree full-heartedly. We should limit the @ericaVoss proposal to only apply to the proper *_concept_id fields of each Domain table. Values, units and operators should absolutely be nullable.
You can still set all of these to 0 because that is what the SOURCE_VALUE column is for. Store 0 in your CONCEPT_ID and store the NULL in the SOURCE_VALUE column (linking MEASUREMENT for reference).
value_as_concept_id â value_source_value
unit_concept_id â unit_source_value
operator_concept_id â
Looks like OPERATOR_CONCEPT_ID doesnât have a SOURCE_VALUE column, but I would argue that it should and that should be a recommendation to the CDM WG.
I feel like this complicates thing that you need to interpret the 0 in the unit_concept_id as there wasnât an actual unit by looking at the unit_source_value for null (and what if we have one of those sources that are affraid of NULLs and instead they store empty strings (ââ) in their data? then we have to check for null and empty string to understand if there was even a value that was mapped from?
Why not just put null in the columns where there wasnât a source value to map from? Thatâs what null represents: the absence of value.
I personally prefer the consistency of CONCEPT_ID is a concept (0 representing the absence of a map) and then store the true raw value in the SOURCE_VALUE column. We arenât going to perform standard analytics on NULL. At Janssen we have been doing this for years and never had a use case for storing a NULL in CONCEPT_ID. The only time I utilize the NULLs from the raw data is for ETL debugging and the SOURCE_VALUE does just find for this.
Not anything more than using it for reviewing what got mapped to 0, which @Ajit_Londhe, @clairblacketer, and I do do. If something didnât map we donât care if it the raw data had NULL, ââ, or garbage - we review to figure out if we can fix it. 0 tells us in a standard way that we didnât map to standard terminology and having the NULL is in the SOURCE_VALUE you can always use this if review what the raw data was.
This is the essence that @MaximMoinat, @Christian_Reich and myself have been saying: 0 implies you got something that failed to map. Null means there wasnât anything in the first place. This isnât about analytical use cases, this is about what the data represents. Iâm surprised that you donât see the benefit of 0 vs. nullâŚif you just wanted to know what didnât have a map, and you allow nulls in those cases, then searching for 0 means youâll find everything that was attempted to map, but didnât. Iâm guessing that the âvalid source but no concept in the vocabularyâ is much less prevalent vs no value to map. Iâm guessing that you are reviewing a lot of needless rows that have concept 0 when you could have just ignored them outright because there was nothing mapped for that row anyways. Of course, the source_value column could help you refine your search, but strictly speaking about the concept columnâŚIf I see a 0 of an unknown concept, iâm assuming that there was some value there that failed to map over to a CDM conceptâŚ
Yes I agree and I still argue why would you differentiate between NULL, ââ, â ', or an unmapped source code. What happens when the raw value actually sends text âNULLâ or âNAâ, in that case I would say 0 but a NULL value is NULL? Just seems inconsistent.
In your example of searching for CONCEPT_ID IS NULL this wouldnât even reliably bring back values you wanted unless the ETL has additional logic to discerning between âtrueâ NULLS and values not mapped. Weâd need logic to handle âflavors of NULLâ which I donât see value in.
I think it is much clearer and consistent to say CONCEPT_ID is 0 when not mapped and store what the raw data gave you in the SOURCE_VALUE column.
One example where I think NULL doesnât work is if Iâm linking DRUG_EXPOSURE to the Vocabulary I would automatically lose records when CONCEPT_ID IS NULL on a JOIN. You would be blind to the fact that there are rows that didnât map with that JOIN.
It really depends on what you mapping.
Well after 5+ years of doing 13+ CDMs that donât have NULLs in CONCEPT_ID I havenât run into a problem or a need for it to be otherwise.
I think thatâs the best example of NULL vs. 0. If they are telling you itâs not a value or NA (which Iâm assuming means ânot applicableâ then saying that it is not a value is being consistent with the original data.
Iâm sorry, I donât understand this point. But you could substitute âflavors of unmapped conceptsâ and your statement would make my point.
This is what a left join is for. And itâs explicitly to say âBring me back rows even if their is no value in the right side table.â The counter to your example is: you do a join on a table and get records back that you didnât want, and youâll have to exclude concept IDs being 0, but you donât know if that 0 meant that there wasnât a value vs. it was something that was unable to mapâŚ
I know youâre just being silly so I wonât comment on that!
If I may throw a technical point to this argument of NULL or not to NULL.
DBMS with MPP architecture apparently dont like NULLs. Apparently NOT NULL columns perform significantly better compared to NULLable columns. JOINS and WHERE work significantly better with NOT NULL columns. You ask me for reference â i heard this someplace and it was stored in my memory some place. Maybe @Chris_Knoll knows! Maybe it is how the âdistribute onâ works - maybe MPP cant distribute NULLs. Also - having not gone to formal SQL school, i learnt the hard way that NULL is not equal to NULL
I am in favor of some flavor of integer â maybe negative integers to represent flavors of NULL, like -1, -2, -3 etc because all our concept_idâs are positive integers anyways⌠So we can clearly define each flavor of special information without any ambiguity and develop conventions on when to use what. This could be then in the OMOP vocabulary as an OMOP-generated concept_id with a vocabulary-id called âSpecial Valuesâ, and domain âSpecial Valuesâ, and concept_class_id âNULL flavorsâ. Then all negative value concept_idâs are some form of special conceptâs and we could support a lot of such special situations using OMOP vocabulary.
This may make writing parameterized SQL and R packages easier - because we dont have to do stuff like
where concept_id in (33423, 32432342) or concept_id is null
I have not heard this, and have done lots of work in optimizing queries on mpp systems. The only place where MPP performance suffers is in the use of left joins (which doesnât have to do with nullable columns). In the use of a Left join. All records from the left table (which may be distributed across multiple nodes) have to be joined to all records found in the right side. This causes data movement across nodes which is what you want to avoid in MPP environments. If it was not a left join, then you can leverage partitioning information from the ON clause to limit the rows compared to records living in the same node.
Based on azure guidance, you use hashing on columns that have no or very few null values. But this is for partitioning guidance (NULLs will go to the same partition, but this is same thing as all 0âs going to the same partition). But nothing in there says ânull columns slow down performanceâ. You just need to be smart about what you partition on. Ref: Distributed tables design guidance - Azure Synapse Analytics | Microsoft Learn.
The first one reads 'give me rows with these specified concepts or nothing at all.
The second one reads âgive me rows with these specified conceptsâŚand I donât know what the last one refers to except itâs an inverse signâ.
I think the topic of âflavors of missing dataâ is conflating the primary point here of NULL vs. 0. The null is very binary: Is there a value or is there not a value, that is the question. Null is very clear: there is no value. If you use 0, then you have to refer to other columns to understand what this 0 means, and you may even have to dig back into the ETL logic about how they decided to code 0 for those cases of ââ, âNULLâ and âNAâ that @ericaVoss mentioned. I might be misunderstanding something, but isnât the point of Themis is that you donât have to go back to ETL logic on a study-by-study basis, and instead there are expectations that can be assumed from an ETL that follows the Themis standards?