I ACTUALLY DIDN’T TATTLE THIS TIME!
agree with @patrick_ryan. But need to know how to do it the right way
i have been thinking about how to handle proprietary codesets, and don’t have a good solution. There are proprietary codes that the specific to one organization and anybody outside has no interest or value for these codes. Describing these codes on some OMOP tables is one option, but then it is a duplication of work that is already being done by the source system - creating a data duplication and maintenance nightmare. I dont know the answer
Maybe it is source_to_concept_map – just need to study it
@Gowtham_Rao as is discussed above, if it’s a proprietary vocabulary we put these in the SOURCE_TO_CONCEPT_MAP complete with all available information from the source. That way, even if you can’t map it to a standard concept you retain the vocabulary_id, etc. for later use. In our data usually if there is an ICD9 code, for example, that doesn’t have a concept_id this is often due to an error in the code and is useless to us anyway. In Truven CCAE, the top code with a source_concept_id of 0 that is not proprietary occurs 893 times and is not mapped because it is probably an ICD9 code though the database told us it should have been an ICD10. It occurs correctly as an ICD9 code over 27mil times.
We have so very many custom source codes. They are throughout every domain of our EHR data. I’m throwing out our process in hopes of getting some feedback. The following refers to mapping our EHR data. I know the claims data is very different.
When we first started our ETL process, we were under the impression the Source to Concept Map table was being deprecated. So, we decided to use the Concept and Concept Relationship table in place of the Source to Concept Map. We create a > 2 billion concept_id for our custom, source codes. We add in all the attributes for every concept. Then we map the > 2 billions to standard concepts in the Concept Relationship table. We use a combination of Usagi, Atlas, and hand mapping.
Two differences in the structure of the tables:
-
Source to Concept Map has a source description field which is not present in the Concept table. And the Concept table has a concept name field which is not present in the Source to Concept Map table. I find the concept name field is adequate and granular enough that a source description of the code is not needed.
-
The Concept Relationship table has relationship start and end dates. The Source to Concept Map table does not.
Our ETL takes the source value and matches it to the concept code field of the Concept table. Regardless if it is an OMOP supported concept or a custom (>2Billion) concept, the ETL will find the concept_id associated with the code. If the code is not a standard concept, the ETL looks to the Concept Relationship table for a “Maps to” relationship where concept_id_2 is a standard concept. We do not have any source_concept_id = 0 EXCEPT when the source updates their source values before we create a custom, source value.
@ericaVoss: How do you use the Source to Concept Map table? I’m always open to ideas and other solutions to manage the custom source concepts.
@MPhilofsky
Thanks for describing your approach. Demonstrates again that we need to standardize the mapping approach for mapping new, local source codes to target concepts. I think your 2 billion concept approach should be the standard way of dealing with local mappings (although we have been using the source_to_conept_map).
Some thoughts:
- In my opinion the source_description and source_name are different names for the same thing. We have been using source_description as the source_name of our local source codes. Using both seems redundant indeed.
- The source_to_concept_map table (stcm) does have valid_start_date and valid_end_date fields. Or do they have a different meaning?
- And I am curious how you have used Atlas for the concept mapping. Does Athena now offer the same features?
And the way we use the stcm is very similar to your approach, but then both the source code info and ‘maps to’ relationship is contained in the stcm. For every source vocabulary we insert rows into the stcm, e.g. these procedure codes. No modification of the standard vocabulary tables needed, except for adding the source vocabulary id to the vocabulary table.
Thank you. For the proprietary source code, do you assign 2-billion+ concept id? The vocabulary id may also be a 2-billion+ correct if it is a proprietary vocabulary.
Do you also append these 2-billion+ codes to ohdsi maintained vocabulary files?
This is a common problem shared by many. Claims data is not immune to it. To support payment innovation - lot of codes are introduced.
Same here. The description seems to suggest that here https://github.com/OHDSI/CommonDataModel/blob/master/Documentation/CommonDataModel_Wiki_Files/StandardizedVocabularies/SOURCE_TO_CONCEPT_MAP.md
We need to clean this up if we want to use this table.
Yup, same here.
Very interested in knowing this. Adding local 2-billion concepts to omop vocabulary tables is not easy i.e. concept, concept ancestor, concept relationship.
Source to Concept Map maybe an easier alternative.
- what do we do when we can’t map to any omop standard code but can map to non standard omop vocabulary codes
- What do we do when we can’t map to any omop vocabulary codes
@chris_knoll does webapi/circe-be support Source to Concept Map table?
.
This is more of a ‘concept set’ question, but if you have custom source concepts that you can provide a mapping in concept_relationship table as a ‘Maps to’ to an existing concept, then you can use the ‘mapped’ option in a concept set.
However, the case you’re talking about: source_to_concept_map table just maps a source code value to a target concept. if you’ve created custom concepts in your CDM so that they appear in CONCEPT, then you can just add the custom concept directly to your concept set expression, and it will be used in the query.
Note that custom concepts are not unique across different CDM nodes: there is no protection that someone with a concept 2bil + 1 is the same as another CDM’s 2bil+1 concept. So caution when creating studies that leverage custom concepts. The durable approach would be to get the concept stanardized in the omop CDM and then map your custom source values to the standard concepts.
The structure of the table is very simple, for every source code you just list where you think it should map to the standard terminology. Here are our examples that we can share (i.e. the non-proprietary containing lists).
Additionally there are tools to help you map source codes to standard terminology and then ultimately produce a SOURCE_TO_CONCEPT_MAP table (see Usagi).
One of our databases coding system is considered proprietary. We have a method of mapping that is a combination of using information provided by the vendor to link up to other source codes in the OMOP Vocabulary codes and mapping by USAGI to get the proprietary codes mapped to standard terminology.
To use it in an ETL, if you open one of our ETL documents, in the “Source to Standard Terminology” section you’ll see we have a standard query. This query either pulls from the Vocabulary or the SOURCE_TO_CONCEPT_MAP table. If I need a map we generated, I filter my query using one of our defined VOCABUALRY_IDs found in the SOURCE_TO_CONCEPT_MAP like this instead:
WHERE SOURCE_VOCABULARY_ID IN ('JNJ_TRU_P_SPCLTY') AND TARGET_STANDARD_CONCEPT IS NOT NULL AND TARGET_INVALID_REASON IS NULL
But if I need a standard map found in the OMOP Vocabulary I would use the same query referenced above and call it with filters like this:
WHERE SOURCE_VOCABULARY_ID IN ('LOINC') AND TARGET_STANDARD_CONCEPT IS NOT NULL
Once I have built all our SOURCE_TO_CONCEPT_MAP files, I just load them into the table. What I like about this is if I do something wrong in the load, I can just truncate this table because the OMOP Vocabulary doesn’t use this table for anything. I’m not touching any of the core OMOP Vocabulary tables and can’t accidentally screw them up.
No, when you use the SOURCE_TO_CONCEPT_MAP you don’t need to worry about giving them CONCEPT_IDs and managing any of that.
This assumes that your proprietary source code has a concept (standard or non-standard) in the CDM vocabulary. There’s a gazillion concepts in the CONCEPT table so you will probably find something that is close enough to what the proprietary source code is representing. remember: the point of the standardized vocabulary is that you only have one concept to represent the single ‘idea’. If you inject duplicate standard ‘ideas’ into the concept table, then the researcher has to know all the different ways of identifying the same ‘thing’ when locating data. As @ericaVoss mentioned: Usagi is a great tool to help you use term/string matching to find a matching concept to your source code value (even better if your source code value has a lengthy name that describes what the source code represents).
We even add codes in when we don’t get them mapped to something, so it is clear we couldn’t find a good standard. We just set TARGET_CONCEPT_ID to 0 (see example). Also, if we start with a 100 source codes from the data it is comforting to end with 100 in the SOURCE_TO_CONCEPT_MAP even if some of them just map to 0. But again, even this is not required.
It is also okay in the SOURCE_TO_CONCEPT_MAP table to map one source code to multiple standard concepts. For example, I created a SOURCE_TO_CONCEPT_MAP for use in the CommonEvidenceModel where we need to map terms from EU Product Labels like “ABACAVIR, LAMIVUDINE”. In this case, this “source code” will map to two TARGET_CONCEPT_IDs: 1736971-abacavir
and 1704183-Lamivudine
.
Why not something under the abacavir/lamivudine dose group? Perhaps this is a separate themis question, but I don’t know if spiting a single exposure to a combination into 2 separate drug_exposure records to represent a combination drug vs. using the combination drug form exists in the vocabulary. If you do it the way you suggest, you won’t find this exposure if you look for the specific combination as the Clinical Dose Group.
In this example, for CEM, I just need to associate evidence to ingredients so this works fine. Also, there sometimes is a representation that you need doesn’t exist in the Concept - so this could be the best representation that you can do.
Agreed! Lots of work. We do not (currently) use the “Concept Ancestor” in our mapping efforts because it is reserved for “standard” concepts.
One table may be easier, BUT the Source to Concept Map doesn’t allow us to create a >2 billion source_concept_id. This is our lifeline back to the source. And allows us to query on the source_concept_id when a mapping to a standard concept it not available or the mapping losses information.
Add it as a source_concept_id and ask the vocabulary team (@Dymshyts & @Christian_Reich) to add the code.
These > 2 billion concept_ids are source_concept_ids. Source_concept_ids are only for local folks/ local studies.
The Concept Relationship table also has valid start and end dates for the relationship. We default to a start date of 1970 & end with 2099. When an appropriate concept_id becomes available in OMOP, we will change the relationship end date and create the new concept relationship record.
We use Usagi to map our string names when there is a high likelihood of getting a very good match as occurs with drugs and procedures. When I mapped our tobacco use concepts I ran in to many problems using Usagi. First, I was unable to limit the domain because tobacco concepts are in multiple domains. Second, Usagi returned many rows that were not the closest match. I found it easier to search in Atlas and then create and export concept sets for tobacco concepts. Using Atlas is a very manual process, but I find it easier to find matching concepts.
The stcm actually does have a column source_concept_id (“A foreign key to the Source Concept that is being translated into a Standard Concept”), which you could combine with a >2 billion concept. In practice, we do not do this as it has little additional value.
Friends:
We discussed this in yesterday’s face-2-face meeting we had for THEMIS. Since this subject does not really belong there, let me describe the conclusions here. We are proposing to do the following:
- The right way to handle local codes and their mappings is the >2B CONCEPT and CONCEPT_RELATIONSHIP table structure. But it isn’t easy to use.
- We will expand PALLAS, allowing to manage local codes
- Life cycle: constant concept_id, separate valid dates for concepts and relationships, configuration of other fields being constant or overwritten
- Different kinds of mappings (equivalent, up-hill)
- Synonyms
- Merging with Standardized Vocabularies (the public one)
- Full compliance with rules: Map to only one Standard (unless split explicitly desired), chaining of deprecated Target Concepts, handling of maps for disappeared source Concepts, etc.
- Versioning and tagging
- PALLAS could be run on ohdsi.org, or deployed locally (for proprietary codes)
Till we have this thing and it is robust, we keep the ugly SOURCE_TO_CONCEPT_MAP going. We also need to talk to @schuemie about adjusting Usagi.
There are couple ideas in this thread, but here I’m specifically talking about the “do we standardize what goes into the various source fields” part of the conversation. Based on the comments in this thread and conversation at the THEMIS F2F, here is what I think the recommendation and action are:
ACTION
Work with CDM WG to update the “Data Model Conventions” part of the wiki in the _SOURCE_VALUE description cell and change it to the below statement.
RECOMMENDATION
Verbatim information from the source data, typically used in ETL to map to CONCEPT_ID, and not to be used by any standard analytics. There is no standardization for these fields and these columns can be used as the local CDM builders see fit. A typical example would be an ICD-9 code without the decimal from an administrative claim as condition_source_value = '78702' which is how it appeared in the source.
Note - this also incorporates the previous verbiage that was there.
The other idea on this thread is how should source codes be managed? Should CONCEPT_IDs be created? Out of the THEMIS F2F currently it was agreed to support both approaches until there are tools to improve the ease of use of creating CONCEPT_IDs for native source codes.
I’m most confident in my SOURCE_TO_CONCEPT_MAP description than the 2B, but I did my best to incorporate the above feedback. Improvements to the text below are welcomed.
RECOMMENDATION
Improving the clarity of what the options are:
There are three approaches to handle source codes that are not in the OMOP Vocabulary (in order of complexity):
1. Leveraging the SOURCE_TO_CONCEPT_MAP: In the OMOP Vocabulary there is an empty table called the SOURCE_TO_CONCEPT_MAP. It is a simple table structure that allows you to establish mapping(s) for each source code with a standard concept in the OMOP Vocabulary (TARGET_CONCEPT_ID). This work can be facilitated by the OHDSI tool USAGI which does text similarity between your source code descriptions and the OMOP Vocabulary and exports mappings in a SOURCE_TO_CONCEPT_MAP table structure. Example SOURCE_TO_CONCEPT_MAP files can be found here. These generated SOURCE_TO_CONCEPT_MAP files are then loaded into the OMOP Vocabulary’s empty SOURCE_TO_CONCEPT_MAP prior to processing the native data into the CDM so that the CDM builder can use them in a build.
2. Adding CONCEPT.CONCEPT_IDs: When an source code is not supported by the OMOP Vocabulary, one can create a new records in the CONCEPT table, however the CONCEPT_IDs should start >2000000000 so that it is easy to tell between the OMOP Vocabulary concepts and the site specific concepts. Once those concepts exist CONCEPT_RELATIONSHIPS can be generated to assign them to a standard terminologies, USAGI can facilitate this process as well.
3. Work with ODYSSEUS Data Services to add to OMOP Vocabulary: The OMOP Vocabulary is an evolving thing and new vocabularies can be added, however working with the ODYSSEUS team is the best way to manage performing this task.
ACTION
Add this description with the CDM WG to the FAQ section.
Looks like discussion has quieted down on this THEMIS recommendation. Are we comfortable with it?
As long as we reverse the order, fine.