Source_concept_id mapping controversy

mgkahn · May 27, 2020, 3:43pm

Long post (as most of mine are…)

Reaching out to the community for advice on an internal controversy regarding ETL approach for mapping source_concept_ids. Because both approaches have strengths/weaknesses, interested in hearing from folks who picked one way versus the other and how it has worked out in actual practice.

Approach 1: For every lexicographical variation of a source_value where there does not yet exist a nonstandard or standard concept, create a custom, nonstandard concept_id and use as source_concept_id. Then create the appropriate mapping to the semantically correct standard concept_id. Thus, for every variation of “positive”, “pos”, “postv”, posssssitve", etc, there is a separate nonstandard concept_id stored in source_concept_id but all of these source_concept_ids map to the one semantically identical standard concept_id. Same with different lexicographical variations for conditions, observations, measurements.

Approach 2: For every lexicographical variation of a source_value, either use an existing concept (standard or nonstandard) or create a new custom concept that captures the common semantic meaning of the variations and map all of these variations to the semantically identical concept as the source_concept_id. Then map that source_concept_id to the correct standard concept_id.

In Approach 1, there is no semantic harmonization occurring at the source_concept_id step – there is 1:1 fidelity with the source_value. In Approach 2, there is some degree of semantic harmonization that may be more “liberal” (more allowed values) than what is allowed in the final standard concept_id vocabulary. But there no longer is a 1:1 fidelity with the source_value.

There is much more that could be said here, including implications for implementation and the location mapping efforts in the ETL tool chain but I’ll stop here to focus only on the question of using the mapping from source_value into source_concept_id for some degree of semantic harmonization or keeping fidelity with lexicographical variations in source_values. In both settings, you end up with a standard concept in concept_id but how best to leverage the source_concept_id mapping is where our internal controversy resides.

DTorok · May 27, 2020, 7:13pm

For Approach 2 where is the mapping of variations of the source_values, “pos”, “postv”, posssssitve", to either the non-standard source concept or the standard concept?

mgkahn · May 27, 2020, 7:36pm

@DTorok: Your question is an implementation question. I’d like to keep the discussion at the “philosophy” level. Do folks use the source_concept_id mapping to do any degree of harmonization (Approach 2) or do folks maintain the same fidelity between source_value and source_concept_id and defer harmonization until mapping to the standard concepts (Approach 1)?

Christian_Reich · May 28, 2020, 5:16am

@mgkahn: Concepts are not lexical units. If you need synonyms put them into the SYNONYM_CONCEPT_ID. From a conceptual perspective do Approach 1.

MPhilofsky · May 28, 2020, 4:29pm

Please clarify for me @Christian_Reich
Are you saying we should add the following source_values:

as rows to the Concept_Synonym table and map them all to concept_id = 45884084?

Example: 59%20AM

Then our ETL will do a lookup in the Concept table first and if the source_value != concept.concept_code (with appropriate JOIN conditions), we lookup in the Concept_Synonym table where source_value = concept_synonym.concept_synonym_name?

OR should we do approach #1 and create custom concept_ids > 2 billion for each source_value and then map via the Concept_Relationship table to concept_id = 45884084?