OHDSI Home | Forums | Wiki | Github

THEMIS Question: What do people put in the source_value fields

I vote to NOT standardize the source_value fields for the same reason as @MPhilofsky.

The data owners in Korean hospitals and national insurance service often complain that they lose some of their data because of standardization of vocabulary. The source_value fields is important place to store their own data in CDM.

I do understand what @Christian_Reich said and I admire his passion for standardization. Ungoverned source_value field might be contrary to the spirit of CDM. But sometimes, people want to conduct their own, or regional collaborative study by using CDM.

I also vote no. Similarly, in the VA we will sometimes use Source_value as the source reported LOINC which we believe (strongly in some cases as we did chart review or NPL with gold standard verification i.e. clinical eyes on the review) to be mismapped but the user is still allowed to review to determined the mapped value in OMOP and for source fidelity. At other times when no concept is available, we have 0 or null and the source value is not null. This is heavily used in labs and drugs, where, as you may guess, with over 2.6 billion rows each, human error occurs with some regularity.

1 Like

I vote no too, but I would like a concept_id field that is something like source_vocabulary_concept_id

Ex. In condition table, the source value may be icd9, icd10, snomed code or whatever else. How do we know?

Adding source_vocabulary_concept_id will improve standardization of *_source_value

1 Like

@gowtham_rao I’m not sure we need a column for source_vocabulary_concept_id. This information is already available in the concept table so it only requires one join.

I also vote to not standardize the source_values. To me, the standardization occurs in the mapping to a standard concept and the source_value is there either for error checking or for a last-resort analysis where the standard concepts are too broad. I see the point of the source_values being a way for the ETL-er to communicate with the end users about how the source were mapped.

True when *_source_concept_id > 0.

How about when *_source_concept_id is =0.? I.e. no omop concept available. How do you know what vocabulary the source value belongs to? As described here

I agree with the rest to not standardize the source_value fields.

However, we might want to standardize the way in which we store non-omop source concepts. As far as I see, there are now two ways: either storing the source concept info (code, vocabulary, description) in the source_to_concept_map or in a new 2-Billionaire concept. I would suggest to use the latter and remove the source_code, source_vocabulary_id and source_description from the source_to_concept_map table and only reference the 2-Billionaire source_concept_id. Or even deprecate the source_to_concept_map table and create the mapping as a new 2-Billionaire record in the concept_relationship table. Any thoughts?

@Gowtham_Rao. Do you store the description of your non-omop source concepts somewhere? e.g. in the source_to_concept_map?

@MaximMoinat, @Gowtham_Rao:

Aaaah. @ericaVoss and I already had some good crossing of swords on this one. She wants to keep the SOURCE_TO_CONCEPT_MAP, I want to put it into its well-deserved retirement with V6. Not really a THEMIS job, let’s take it to the CDM Working Group and duke it out.

I’m strongly with @ericaVoss on this one. I think we do need conventions
for how to use the SOURCE_TO_CONCEPT_MAP table, but it has proved to be a
valuable standardized structure for handling the reality that not all
source codes will be in the standard vocabulary. Providing a defined
solution to this problem would be a help for the whole community.


Argh!!! She came tattling and whining to her “big brother” to get some backup! :smile:

Let’s discuss in a new Forum. I’ll open.


agree with @patrick_ryan. But need to know how to do it the right way

i have been thinking about how to handle proprietary codesets, and don’t have a good solution. There are proprietary codes that the specific to one organization and anybody outside has no interest or value for these codes. Describing these codes on some OMOP tables is one option, but then it is a duplication of work that is already being done by the source system - creating a data duplication and maintenance nightmare. I dont know the answer

Maybe it is source_to_concept_map – just need to study it

@Gowtham_Rao as is discussed above, if it’s a proprietary vocabulary we put these in the SOURCE_TO_CONCEPT_MAP complete with all available information from the source. That way, even if you can’t map it to a standard concept you retain the vocabulary_id, etc. for later use. In our data usually if there is an ICD9 code, for example, that doesn’t have a concept_id this is often due to an error in the code and is useless to us anyway. In Truven CCAE, the top code with a source_concept_id of 0 that is not proprietary occurs 893 times and is not mapped because it is probably an ICD9 code though the database told us it should have been an ICD10. It occurs correctly as an ICD9 code over 27mil times.

We have so very many custom source codes. They are throughout every domain of our EHR data. I’m throwing out our process in hopes of getting some feedback. The following refers to mapping our EHR data. I know the claims data is very different.

When we first started our ETL process, we were under the impression the Source to Concept Map table was being deprecated. So, we decided to use the Concept and Concept Relationship table in place of the Source to Concept Map. We create a > 2 billion concept_id for our custom, source codes. We add in all the attributes for every concept. Then we map the > 2 billions to standard concepts in the Concept Relationship table. We use a combination of Usagi, Atlas, and hand mapping.

Two differences in the structure of the tables:

  1. Source to Concept Map has a source description field which is not present in the Concept table. And the Concept table has a concept name field which is not present in the Source to Concept Map table. I find the concept name field is adequate and granular enough that a source description of the code is not needed.

  2. The Concept Relationship table has relationship start and end dates. The Source to Concept Map table does not.

Our ETL takes the source value and matches it to the concept code field of the Concept table. Regardless if it is an OMOP supported concept or a custom (>2Billion) concept, the ETL will find the concept_id associated with the code. If the code is not a standard concept, the ETL looks to the Concept Relationship table for a “Maps to” relationship where concept_id_2 is a standard concept. We do not have any source_concept_id = 0 EXCEPT when the source updates their source values before we create a custom, source value.

@ericaVoss: How do you use the Source to Concept Map table? I’m always open to ideas and other solutions to manage the custom source concepts.

Thanks for describing your approach. Demonstrates again that we need to standardize the mapping approach for mapping new, local source codes to target concepts. I think your 2 billion concept approach should be the standard way of dealing with local mappings (although we have been using the source_to_conept_map).

Some thoughts:

  • In my opinion the source_description and source_name are different names for the same thing. We have been using source_description as the source_name of our local source codes. Using both seems redundant indeed.
  • The source_to_concept_map table (stcm) does have valid_start_date and valid_end_date fields. Or do they have a different meaning?
  • And I am curious how you have used Atlas for the concept mapping. Does Athena now offer the same features?

And the way we use the stcm is very similar to your approach, but then both the source code info and ‘maps to’ relationship is contained in the stcm. For every source vocabulary we insert rows into the stcm, e.g. these procedure codes. No modification of the standard vocabulary tables needed, except for adding the source vocabulary id to the vocabulary table.

Thank you. For the proprietary source code, do you assign 2-billion+ concept id? The vocabulary id may also be a 2-billion+ correct if it is a proprietary vocabulary.

Do you also append these 2-billion+ codes to ohdsi maintained vocabulary files?

This is a common problem shared by many. Claims data is not immune to it. To support payment innovation - lot of codes are introduced.

Same here. The description seems to suggest that here https://github.com/OHDSI/CommonDataModel/blob/master/Documentation/CommonDataModel_Wiki_Files/StandardizedVocabularies/SOURCE_TO_CONCEPT_MAP.md

We need to clean this up if we want to use this table.

Yup, same here.

Very interested in knowing this. Adding local 2-billion concepts to omop vocabulary tables is not easy i.e. concept, concept ancestor, concept relationship.
Source to Concept Map maybe an easier alternative.

  1. what do we do when we can’t map to any omop standard code but can map to non standard omop vocabulary codes
  2. What do we do when we can’t map to any omop vocabulary codes

@chris_knoll does webapi/circe-be support Source to Concept Map table?


This is more of a ‘concept set’ question, but if you have custom source concepts that you can provide a mapping in concept_relationship table as a ‘Maps to’ to an existing concept, then you can use the ‘mapped’ option in a concept set.

However, the case you’re talking about: source_to_concept_map table just maps a source code value to a target concept. if you’ve created custom concepts in your CDM so that they appear in CONCEPT, then you can just add the custom concept directly to your concept set expression, and it will be used in the query.

Note that custom concepts are not unique across different CDM nodes: there is no protection that someone with a concept 2bil + 1 is the same as another CDM’s 2bil+1 concept. So caution when creating studies that leverage custom concepts. The durable approach would be to get the concept stanardized in the omop CDM and then map your custom source values to the standard concepts.

The structure of the table is very simple, for every source code you just list where you think it should map to the standard terminology. Here are our examples that we can share (i.e. the non-proprietary containing lists).

Additionally there are tools to help you map source codes to standard terminology and then ultimately produce a SOURCE_TO_CONCEPT_MAP table (see Usagi).

One of our databases coding system is considered proprietary. We have a method of mapping that is a combination of using information provided by the vendor to link up to other source codes in the OMOP Vocabulary codes and mapping by USAGI to get the proprietary codes mapped to standard terminology.

To use it in an ETL, if you open one of our ETL documents, in the “Source to Standard Terminology” section you’ll see we have a standard query. This query either pulls from the Vocabulary or the SOURCE_TO_CONCEPT_MAP table. If I need a map we generated, I filter my query using one of our defined VOCABUALRY_IDs found in the SOURCE_TO_CONCEPT_MAP like this instead:


But if I need a standard map found in the OMOP Vocabulary I would use the same query referenced above and call it with filters like this:

Once I have built all our SOURCE_TO_CONCEPT_MAP files, I just load them into the table. What I like about this is if I do something wrong in the load, I can just truncate this table because the OMOP Vocabulary doesn’t use this table for anything. I’m not touching any of the core OMOP Vocabulary tables and can’t accidentally screw them up.

No, when you use the SOURCE_TO_CONCEPT_MAP you don’t need to worry about giving them CONCEPT_IDs and managing any of that.

This assumes that your proprietary source code has a concept (standard or non-standard) in the CDM vocabulary. There’s a gazillion concepts in the CONCEPT table so you will probably find something that is close enough to what the proprietary source code is representing. remember: the point of the standardized vocabulary is that you only have one concept to represent the single ‘idea’. If you inject duplicate standard ‘ideas’ into the concept table, then the researcher has to know all the different ways of identifying the same ‘thing’ when locating data. As @ericaVoss mentioned: Usagi is a great tool to help you use term/string matching to find a matching concept to your source code value (even better if your source code value has a lengthy name that describes what the source code represents).

We even add codes in when we don’t get them mapped to something, so it is clear we couldn’t find a good standard. We just set TARGET_CONCEPT_ID to 0 (see example). Also, if we start with a 100 source codes from the data it is comforting to end with 100 in the SOURCE_TO_CONCEPT_MAP even if some of them just map to 0. But again, even this is not required.

It is also okay in the SOURCE_TO_CONCEPT_MAP table to map one source code to multiple standard concepts. For example, I created a SOURCE_TO_CONCEPT_MAP for use in the CommonEvidenceModel where we need to map terms from EU Product Labels like “ABACAVIR, LAMIVUDINE”. In this case, this “source code” will map to two TARGET_CONCEPT_IDs: 1736971-abacavir and 1704183-Lamivudine.

Why not something under the abacavir/lamivudine dose group? Perhaps this is a separate themis question, but I don’t know if spiting a single exposure to a combination into 2 separate drug_exposure records to represent a combination drug vs. using the combination drug form exists in the vocabulary. If you do it the way you suggest, you won’t find this exposure if you look for the specific combination as the Clinical Dose Group.