Suspected Diagnosis and its place in the OMOP CDM

MPhilofsky · July 10, 2020, 4:07pm

The conventions for Observation.value_as_string are “The observation result stored as a string. This is applicable to observations where the result is expressed as verbatim text.”. In my example, “Myocardial infarction” is the verbatim text from the source. And the Observation.value_as_concept_id = 312327 is the standard concept for the verbatim text. If the data came across as a source_code which mapped to a standard concept_id, then I wouldn’t insert the code in the value_as_string field. However, my EHR data stores it as free text.

And I completely agree with

I also agree with this

Those of us working with EHR data try to map to all source values to standard concept_ids. But the reality of the situation is there is a very long tail of singletons about a mile long and it would take a very long time to map every string to a concept_id. It is a waste of time and resources to map every string. However, keeping the string data in the CDM allows data holders behind the firewall to view the unmapped source values to assess their worth. The data holders can view the the unmapped string results to see if the unmapped values are mappable, update their mappings, rerun the ETL and participate more fully in community research. This information is also available for many (all?) other concept_id fields

The above is the use case for @Alexdavv’s proposal to add an Observation.value_source_value field to the CDM

MPhilofsky · July 10, 2020, 4:07pm

Thanks, @Alexdavv!

Alexdavv · July 10, 2020, 5:18pm

I assume @Christian_Reich and @Chris_Knoll advocate that the value_as_string field is for storing the verbatim result, but only when it’s already a processed result. When it was found, that this result:

is not numeric (that should be placed to value_as_number field only);
and cannot be mapped to any concept (that should be placed to the value_as_concept_id only).

Namely, instrument raw data, DNA sequences, proper names (not patient’s, for sure), etc.

And the main reason is that standardized analytics (basically, string match) can be applied to the value_as_string field. That is why we need to keep this field fairly clean.
Normally nobody applies standardized analytics to the _source_value fields, where this particular ‘Myocardial infarction’ from the source to be placed.

Christian_Reich · July 10, 2020, 6:44pm

I think we are all on the same page, here. Let’s add the source_value proposal, and let’s put a proposal in to get rid of the string thing, which will not be executed until the time that we have much better capacity to map things.

Andy_Kanter · July 13, 2020, 1:32am

Christian, it was a little hard to follow all of the discussion here, and the switching between diagnosis, procedure and attributes of diagnoses was confusing. I think the question of saving the original text from the source, in addition to the coded value is best practice. Not only for helping to ensure fidelity is not lost when people go back to extract more information or check the standard concept maps, but also from a provenance perspective. I think CDA and FHIR are both ensuring original text is not lost.