ETL from Unmapped Sources

mkwong · April 23, 2018, 4:53pm

From a practical point of view - financially and opportunistically I have also been taking @mgkahn approach. Typically there are a number of locally funded projects that can be served effectively by _source_concept_id first, _concept_id mapping later. I think it is a long term objective to move everyone’s thinking toward the OMOP model to support/enable multi-site studies. But until there is significant financial pull - I for one will be primarily driven by local needs first. Another pragmatic issue for us is we want to be able to conduct cohort discovery utilizing detailed medical device observations and measurements - which for some studies is much more accurate than utilizing summarized (data reduction) OMOP standard vocabularies - particularly when existing predictive analytics are involved. This means creating local medical device vocabularies first (utilizing measurement_source_concept_id first), getting the data into OMOP database and using it. Then work on proposing a new vocabulary to the OHDSI community - which I still need to find the time to do. If I had the financial backing and time - I’d go with Christian’s model.

Christian_Reich · April 25, 2018, 11:48am

Friends:

This issue is now boiling over. Folks have trouble with their source data with no way to organize non-standard vocabularies, incorporate them as concepts and map them to standard ones.

We said we’d come up with a proposal to fix this problem. Here is one:

We want to build a website for anybody to manage their data, and essentially do what the vocabulary team does on a large scale for the Standardized Vocabularies. Something like:

  Upload of spreadsheets, with or without concept_codes, including deprecations

```
  Manage the vocabulary_id
```

  If necessary, auto-create the concept_code or recognize you already have one

  Auto-create the concept_id or recognize you already have one

  Upload mapping (or later create mappings through an integrated USAGI) to standard concepts

  Maintain rules (can’t map to deprecated code, etc.)

Once you have all this, you export your new concepts together with the Standardized Vocabularies from ATHENA.

Thoughts? @mgkahn? @MPhilofsky? @mkwong? @esholle? @daniellameeker? @Mark_Danese?

C

Christian_Reich · April 25, 2018, 11:48am

And we need a Greek name for that thing.

mkwong · April 25, 2018, 4:59pm

Hi, I think this would be a good start to post and share work that gives the vocab team easy access to work in progress for adoption too. Thanks

Mark_Danese · April 25, 2018, 7:08pm

Anything that makes mapping easier has to be a good idea. Ptolemy is an obvious choice but any greek cartographer would do.

MPhilofsky · April 25, 2018, 9:42pm

@Christian_Reich,

Awesome! @mgkahn and I love the idea! We have a couple of questions:

Will the tool support the creation of local “standard” concept ids > 2000000000?
Will the tool support the creation of local “classification” concept ids > 2000000000?
Will the tool provide an export file suitable for upload to all the Standardized Vocabulary tables?

Christian_Reich · April 25, 2018, 9:54pm

@MPhilofsky:

That’s the idea. But the tool will probably put pressure on you to make them non-standard, and map them, using a built-in USAGI.
Same.
Yes. The idea is that when you download from Athena your dogfood will be included in the zip file. But nobody else will get it.

And then we need some way to share them, because all these silly lab test names are probably often repeated across the institutions.

Mist · August 13, 2018, 9:38am

Hi ,i want to learn something about the Drug domain.I always get some unpredictable results when i map my concepts.Is there a way to have Usagi match source code string to concept code string? Instead of semantically similar terms? My procedure source terms are producing high match scores to similar procedures, but not the exact CPT source procedure. Going line by line to pick out the exact match would take forever.

Dymshyts · August 13, 2018, 4:45pm

Hi,

In this case you don’t need Usagi, you just find the concept with the minimal levenshtein distance.
But even this will not work, because, let’s say “Aspirin 50 MG Oral Tablet” and “Aspirin 10 MG Oral Tablet” are very close, but different concepts, while
“Aspirin 50 MG Oral Tablet” and “Aspirin 50 MG Oral coated tablet” represent the same drug but have bigger difference than the example above.

We built the algorithm, extracting the logical attributes that make the Drug concept:
Ingredient, Dose Form, Dosage, Brand Name, Quantity, Supplier. Then we map them separately to the standard Attributes.
This way we get the accurate Drug mapping.

Mist · August 14, 2018, 1:23am

Thank you very much!I am so excited for your replaying! But I am sorry that i still have some problem to solve. If i map them separately to the standard Attributes,whether i only need one of them to be the procedure source terms every time and i still need use Usagi to do it or not?

Dymshyts · August 14, 2018, 8:15am

Well, the main point in the mapping process is to define those attributes,
and then you just use the simple name matching using concept_synonym_names.
sometimes, yes, you really need some fuzzy matching algorithms, i.e. Usagi, or semantic analysis, but most of the cases will match by the name equality.
for example
Aspirin 50 MG Oral Tablet [Halfprin]
Aspirin ->1112807 Aspirin
Oral Tablet -> 19082573 Oral Tablet
Halfprin -> 19068001 Halfprin

Mist · August 15, 2018, 2:31am

Thank you again! Does Usagi also use this algorithm like you speak? My source term is chinese,Do you know the difference between them?

Mist · August 15, 2018, 2:33am

Oh，i forget to say my Concept name is also Chinese.

Dymshyts · August 16, 2018, 10:18am

no, Usage doesn’t do that.

Of course,you need to translate your concepts into Englsih

here is an explanation of how to create the tables
http://www.ohdsi.org/web/wiki/doku.php?id=documentation:international_drugs

here is a script that bouils these tables together matching the concepts

github.com

OHDSI/Vocabulary-v5.0/blob/master/working/Build_RxE.sql

/**************************************************************************
* Copyright 2016 Observational Health Data Sciences and Informatics (OHDSI)
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
* 
* Authors: Christian Reich
* Date: 2016-2017
**************************************************************************/

/***************************************************************************

This file has been truncated. show original

Mist · August 24, 2018, 2:41am

hi,i still can’t understand about why drugs mapping can’t be right enough.Can you tell me the reason in detail? I am chinese so my drugs is also in chinese ,and I touch in this field just for one month.Thank you very much!

aostropolets · August 24, 2018, 2:11pm

You can basically map your drugs to RxNorm/RxNorm Extension in any way that you like. You may use Usagi, but it’s usually pretty biased when it comes to drugs. You also need to translate the names of the drugs first to use it. The second option is to map them manually (which is quite time-consuming and you’ll have to do it over and over again when the next refresh of your vocabulary comes). Or you may use the standard approach with scripts that will find the standard counterparts for your drugs. This approach requires the creation of intermediate tables where everything needs to be in English though. Bonuses: you’ll get a reproducible and more or less automated way; the result will be more reliable;you can put the vocab into the OMOP vocabularies set so it wil be available on Athena and will capture all the details from your source vocabulary ( like Chinese brand names or manufacturers, even if they don’t exist in current RxNorm).