OHDSI Home | Forums | Wiki | Github

ETL from Unmapped Sources

@Mark_Danese:

I understand you are trying to be multilingual, just like the automated voice when you call your health insurance “para Espanol o prima dos”. But really, all the relevant communication happens in English around here. Because it’s hard enough already, and to get everything bilingual, or trilingual in your case, is an enormous burden, either for the conversions, or for the tools, or for the analyst who has to know all the idiosyncrasies.

But this is actually not about OMOP vs the others. It is about speaking proper OMOP, to be able to communicate to the others, which requires saying things in commonly understood vocabulary. Only then the queries are truly interoperable.

"And why we build software that is data model agnostic.” It seems to me that something like a relational database management system would be truly agnostic, but the Generalized Data Model (GDM) is yet another data model that you have built optimized for ETL. When someone in the future decides to create an Improved Generalized Data Model that is not backwards compatible with GDM (because it uses a network or OO architecture instead of hierarchical), there will be more religious wars. And at some point you want to use the data to do something, and with 100,000,000s patients each with 10,000s of variables, the physical clustering of the data and the indices do matter and you have to make a decision how to actually store it.

@hripcsa You are correct – there is always a difference between the logical model (i.e., how the parts relate to one another) and the physical model (i.e., how the data is stored). The trick in all of this is to figure out when one is moving data around to organize it, and when one is translating the meaning of the data. Moving the data is more backwards compatible. Translating the meaning is less backwards compatible, depending how it is done.

For GDM we optimized it for flexibility and don’t actually specify hierarchical vs. relational. We try and present it both ways, for exactly this reason. And we have played with hierarchical versions of OMOP too, for speed reasons. As you point out, the real issue is how to get things to run on a wide variety of data sources in a reasonable time. The alignment between the data model and the database storage is very important for data with hundreds of millions of people.

I am sorry if I have taken us off track here. I think the discussion was about whether it is ok to do the mapping later and my response was that the OMOP model has a life of its own outside the network and OHDSI tools, and this life doesn’t require all of the mappings. And, in my opinion, that is ok.

1 Like

I will. Let me tell you a little about our business because it influences our decisions.

Our/University of Colorado’s reason for using The OMOP CDM is a little different than most active forum users. We use the OMOP CDM as our primary data warehouse and deliver (just starting this phase now) datasets to our customers (providers, managers in the hospital system doing QA/QI, PhD students, professors, and all our data collaborators). We do incremental loads from the source through the pipeline to OMOP every day. We do not do the analytics. If we can’t use the OMOP CDM to deliver datasets to our customer, maintaining OMOP would be prohibitive. All these factors give us a different perspective.

We did this mostly in parallel with the goal of mapping

codes first. The Person table is easy, the Visit table took more work, but only because the source has a different view of a visit than the OHDSI community. It took a couple days of using Usagi to go from 86% to 99% of drugs administered to a patient being mapped to a standard concept, which was easy because drug names are fairly standardized. Conditions and Procedures are tough because the terminology varies from source names to target concepts. And social history (tobacco, alcohol, drugs) is a beast. We are also working on other data that belongs in the Observation table. This data is mostly unmapped, but also not used in network studies (pain, scoring systems, ventilator settings, etc.) @Christian_Reich.

Organizing the data is a lot of work and very error prone. As @mgkahn stated, we take very large spreadsheets of local codes and put them in to Usagi one domain at a time. Then the output is altered to fit the Concept and Concept Relationship tables (@Christian_Reich help here) - also an error prone step. Next is uploading them to the necessary tables and testing for accuracy. All of this is very time consuming. Yes, a lot of work, but the time savings, especially with using the hierarchies, makes the effort worth it, I hope :wink: We haven’t put this into production.

I would put in only the necessary data, not everything, just what is going to be used, in to the source_value column during the ETL. I suggest putting in the local code and the name separated with a delimiter. You will want to get on the standard side of the model as quickly as possible to leverage the vocabulary. The power of the OMOP CDM is in the vocabularies! Leveraging one concept and its hierarchy is much easier than using a string search to find all relevant concepts and then listing out each concept (error prone).

Put the data where it belongs or the tools won’t work. But if it doesn’t belong in any other domain, it goes into the Observation table and SQL will pull it out.

I’m happy to share or elaborate, @TCKeen!

Yes, we skipped the “ugly” Fact Relationship table and created a Care Site Hierarchy table instead. It’s connected to the OMOP Care Site table via FK. It follows the OMOP Care Site table structure, except it is 6 layers/fields: system, region, site, operational group, facility, and department. Same structure just expanded. And it’s very similar, if not exactly the same, to our source system representation of care site.

1 Like
  1. Here’s the thing, we are having to create custom, “standard” concepts to represent all of the things/ideas/datum that do not have a standard representation from a standard vocabulary. My latest example is pain scores. We have ~650 unique representations for pain in our source. Our customers want all that data to study who has pain, are people being asked about their pain, what is their pain score, how are they treated, how effective is the treatment, and so on. Some of these map to standard concepts ( 0-10 pain score, pain duration, pain characteristics) and others don’t (pain intervention, response to relief measures, different pain scales). So, I have taken the liberty to create custom, standard concepts :open_mouth:. These are concepts that would otherwise be mapped to concept_id = 0. I have also created hierarchies in the Concept Ancestor table for these now standard concepts to make life easier for the query writers. One concept and all its descendants or 650 individual concepts? And that 650 is an increasing number. If these things had a representation from a widely accepted vocabulary, I would bring it to the OHDSI vocabulary folks, but they don’t. Pain is all over the news and the literature, but lacks standardization. However, that doesn’t stop the researchers from wanting to research it or the QA/QI folks from wanting to know what’s happening in our house.

2 This also keeps me awake at night. I am looking forward to a solution that doesn’t involve me remapping codes every time a vocabulary is refreshed.

1 Like

From a practical point of view - financially and opportunistically I have also been taking @mgkahn approach. Typically there are a number of locally funded projects that can be served effectively by _source_concept_id first, _concept_id mapping later. I think it is a long term objective to move everyone’s thinking toward the OMOP model to support/enable multi-site studies. But until there is significant financial pull - I for one will be primarily driven by local needs first. Another pragmatic issue for us is we want to be able to conduct cohort discovery utilizing detailed medical device observations and measurements - which for some studies is much more accurate than utilizing summarized (data reduction) OMOP standard vocabularies - particularly when existing predictive analytics are involved. This means creating local medical device vocabularies first (utilizing measurement_source_concept_id first), getting the data into OMOP database and using it. Then work on proposing a new vocabulary to the OHDSI community - which I still need to find the time to do. If I had the financial backing and time - I’d go with Christian’s model.

Friends:

This issue is now boiling over. Folks have trouble with their source data with no way to organize non-standard vocabularies, incorporate them as concepts and map them to standard ones.

We said we’d come up with a proposal to fix this problem. Here is one:

We want to build a website for anybody to manage their data, and essentially do what the vocabulary team does on a large scale for the Standardized Vocabularies. Something like:

  •   Upload of spreadsheets, with or without concept_codes, including deprecations
    
  •   Manage the vocabulary_id
    
  •   If necessary, auto-create the concept_code or recognize you already have one
    
  •   Auto-create the concept_id or recognize you already have one
    
  •   Upload mapping (or later create mappings through an integrated USAGI) to standard concepts
    
  •   Maintain rules (can’t map to deprecated code, etc.) 
    

Once you have all this, you export your new concepts together with the Standardized Vocabularies from ATHENA.

Thoughts? @mgkahn? @MPhilofsky? @mkwong? @esholle? @daniellameeker? @Mark_Danese?

C

2 Likes

And we need a Greek name for that thing. :smile:

Hi, I think this would be a good start to post and share work that gives the vocab team easy access to work in progress for adoption too. Thanks

Anything that makes mapping easier has to be a good idea. Ptolemy is an obvious choice but any greek cartographer would do.

1 Like

@Christian_Reich,

Awesome! @mgkahn and I love the idea! We have a couple of questions:

  1. Will the tool support the creation of local “standard” concept ids > 2000000000?
  2. Will the tool support the creation of local “classification” concept ids > 2000000000?
  3. Will the tool provide an export file suitable for upload to all the Standardized Vocabulary tables?

@MPhilofsky:

  1. That’s the idea. But the tool will probably put pressure on you to make them non-standard, and map them, using a built-in USAGI.
  2. Same.
  3. Yes. The idea is that when you download from Athena your dogfood will be included in the zip file. But nobody else will get it.

And then we need some way to share them, because all these silly lab test names are probably often repeated across the institutions.

Hi ,i want to learn something about the Drug domain.I always get some unpredictable results when i map my concepts.Is there a way to have Usagi match source code string to concept code string? Instead of semantically similar terms? My procedure source terms are producing high match scores to similar procedures, but not the exact CPT source procedure. Going line by line to pick out the exact match would take forever.

Hi,

In this case you don’t need Usagi, you just find the concept with the minimal levenshtein distance.
But even this will not work, because, let’s say “Aspirin 50 MG Oral Tablet” and “Aspirin 10 MG Oral Tablet” are very close, but different concepts, while
“Aspirin 50 MG Oral Tablet” and “Aspirin 50 MG Oral coated tablet” represent the same drug but have bigger difference than the example above.

We built the algorithm, extracting the logical attributes that make the Drug concept:
Ingredient, Dose Form, Dosage, Brand Name, Quantity, Supplier. Then we map them separately to the standard Attributes.
This way we get the accurate Drug mapping.

Thank you very much!I am so excited for your replaying! But I am sorry that i still have some problem to solve. If i map them separately to the standard Attributes,whether i only need one of them to be the procedure source terms every time and i still need use Usagi to do it or not?

Well, the main point in the mapping process is to define those attributes,
and then you just use the simple name matching using concept_synonym_names.
sometimes, yes, you really need some fuzzy matching algorithms, i.e. Usagi, or semantic analysis, but most of the cases will match by the name equality.
for example
Aspirin 50 MG Oral Tablet [Halfprin]
Aspirin ->1112807 Aspirin
Oral Tablet -> 19082573 Oral Tablet
Halfprin -> 19068001 Halfprin

Thank you again! Does Usagi also use this algorithm like you speak? My source term is chinese,Do you know the difference between them?

Oh,i forget to say my Concept name is also Chinese.

no, Usage doesn’t do that.

Of course,you need to translate your concepts into Englsih

here is an explanation of how to create the tables
http://www.ohdsi.org/web/wiki/doku.php?id=documentation:international_drugs

here is a script that bouils these tables together matching the concepts

t