OHDSI Home | Forums | Wiki | Github

THEMIS Question: What do people put in the source_value fields

This assumes that your proprietary source code has a concept (standard or non-standard) in the CDM vocabulary. There’s a gazillion concepts in the CONCEPT table so you will probably find something that is close enough to what the proprietary source code is representing. remember: the point of the standardized vocabulary is that you only have one concept to represent the single ‘idea’. If you inject duplicate standard ‘ideas’ into the concept table, then the researcher has to know all the different ways of identifying the same ‘thing’ when locating data. As @ericaVoss mentioned: Usagi is a great tool to help you use term/string matching to find a matching concept to your source code value (even better if your source code value has a lengthy name that describes what the source code represents).

We even add codes in when we don’t get them mapped to something, so it is clear we couldn’t find a good standard. We just set TARGET_CONCEPT_ID to 0 (see example). Also, if we start with a 100 source codes from the data it is comforting to end with 100 in the SOURCE_TO_CONCEPT_MAP even if some of them just map to 0. But again, even this is not required.

It is also okay in the SOURCE_TO_CONCEPT_MAP table to map one source code to multiple standard concepts. For example, I created a SOURCE_TO_CONCEPT_MAP for use in the CommonEvidenceModel where we need to map terms from EU Product Labels like “ABACAVIR, LAMIVUDINE”. In this case, this “source code” will map to two TARGET_CONCEPT_IDs: 1736971-abacavir and 1704183-Lamivudine.

Why not something under the abacavir/lamivudine dose group? Perhaps this is a separate themis question, but I don’t know if spiting a single exposure to a combination into 2 separate drug_exposure records to represent a combination drug vs. using the combination drug form exists in the vocabulary. If you do it the way you suggest, you won’t find this exposure if you look for the specific combination as the Clinical Dose Group.

In this example, for CEM, I just need to associate evidence to ingredients so this works fine. Also, there sometimes is a representation that you need doesn’t exist in the Concept - so this could be the best representation that you can do.

Agreed! Lots of work. We do not (currently) use the “Concept Ancestor” in our mapping efforts because it is reserved for “standard” concepts.

One table may be easier, BUT the Source to Concept Map doesn’t allow us to create a >2 billion source_concept_id. This is our lifeline back to the source. And allows us to query on the source_concept_id when a mapping to a standard concept it not available or the mapping losses information.

Add it as a source_concept_id and ask the vocabulary team (@Dymshyts & @Christian_Reich) to add the code.

These > 2 billion concept_ids are source_concept_ids. Source_concept_ids are only for local folks/ local studies.

The Concept Relationship table also has valid start and end dates for the relationship. We default to a start date of 1970 & end with 2099. When an appropriate concept_id becomes available in OMOP, we will change the relationship end date and create the new concept relationship record.

We use Usagi to map our string names when there is a high likelihood of getting a very good match as occurs with drugs and procedures. When I mapped our tobacco use concepts I ran in to many problems using Usagi. First, I was unable to limit the domain because tobacco concepts are in multiple domains. Second, Usagi returned many rows that were not the closest match. I found it easier to search in Atlas and then create and export concept sets for tobacco concepts. Using Atlas is a very manual process, but I find it easier to find matching concepts.

The stcm actually does have a column source_concept_id (“A foreign key to the Source Concept that is being translated into a Standard Concept”), which you could combine with a >2 billion concept. In practice, we do not do this as it has little additional value.


We discussed this in yesterday’s face-2-face meeting we had for THEMIS. Since this subject does not really belong there, let me describe the conclusions here. We are proposing to do the following:

  • The right way to handle local codes and their mappings is the >2B CONCEPT and CONCEPT_RELATIONSHIP table structure. But it isn’t easy to use.
  • We will expand PALLAS, allowing to manage local codes
  • Life cycle: constant concept_id, separate valid dates for concepts and relationships, configuration of other fields being constant or overwritten
  • Different kinds of mappings (equivalent, up-hill)
  • Synonyms
  • Merging with Standardized Vocabularies (the public one)
  • Full compliance with rules: Map to only one Standard (unless split explicitly desired), chaining of deprecated Target Concepts, handling of maps for disappeared source Concepts, etc.
  • Versioning and tagging
  • PALLAS could be run on ohdsi.org, or deployed locally (for proprietary codes)

Till we have this thing and it is robust, we keep the ugly SOURCE_TO_CONCEPT_MAP going. We also need to talk to @schuemie about adjusting Usagi.

1 Like

There are couple ideas in this thread, but here I’m specifically talking about the “do we standardize what goes into the various source fields” part of the conversation. Based on the comments in this thread and conversation at the THEMIS F2F, here is what I think the recommendation and action are:

Work with CDM WG to update the “Data Model Conventions” part of the wiki in the _SOURCE_VALUE description cell and change it to the below statement.

Verbatim information from the source data, typically used in ETL to map to CONCEPT_ID, and not to be used by any standard analytics. There is no standardization for these fields and these columns can be used as the local CDM builders see fit. A typical example would be an ICD-9 code without the decimal from an administrative claim as condition_source_value = '78702' which is how it appeared in the source.

Note - this also incorporates the previous verbiage that was there.

1 Like

The other idea on this thread is how should source codes be managed? Should CONCEPT_IDs be created? Out of the THEMIS F2F currently it was agreed to support both approaches until there are tools to improve the ease of use of creating CONCEPT_IDs for native source codes.

I’m most confident in my SOURCE_TO_CONCEPT_MAP description than the 2B, but I did my best to incorporate the above feedback. Improvements to the text below are welcomed.

Improving the clarity of what the options are:

There are three approaches to handle source codes that are not in the OMOP Vocabulary (in order of complexity):

1. Leveraging the SOURCE_TO_CONCEPT_MAP: In the OMOP Vocabulary there is an empty table called the SOURCE_TO_CONCEPT_MAP. It is a simple table structure that allows you to establish mapping(s) for each source code with a standard concept in the OMOP Vocabulary (TARGET_CONCEPT_ID). This work can be facilitated by the OHDSI tool USAGI which does text similarity between your source code descriptions and the OMOP Vocabulary and exports mappings in a SOURCE_TO_CONCEPT_MAP table structure. Example SOURCE_TO_CONCEPT_MAP files can be found here. These generated SOURCE_TO_CONCEPT_MAP files are then loaded into the OMOP Vocabulary’s empty SOURCE_TO_CONCEPT_MAP prior to processing the native data into the CDM so that the CDM builder can use them in a build.

2. Adding CONCEPT.CONCEPT_IDs: When an source code is not supported by the OMOP Vocabulary, one can create a new records in the CONCEPT table, however the CONCEPT_IDs should start >2000000000 so that it is easy to tell between the OMOP Vocabulary concepts and the site specific concepts. Once those concepts exist CONCEPT_RELATIONSHIPS can be generated to assign them to a standard terminologies, USAGI can facilitate this process as well.

3. Work with ODYSSEUS Data Services to add to OMOP Vocabulary: The OMOP Vocabulary is an evolving thing and new vocabularies can be added, however working with the ODYSSEUS team is the best way to manage performing this task.

Add this description with the CDM WG to the FAQ section.

Looks like discussion has quieted down on this THEMIS recommendation. Are we comfortable with it?

As long as we reverse the order, fine.

1 Like