OHDSI Home | Forums | Wiki | Github

ETL from Unmapped Sources

tl;dr

Are there any approaches to mapping source data to recognised concepts iteratively, in parallel with (or after) building the ETL to OMOP?

What is it like to map a large (uncoded) dataset by hand with Usagi?

Hi Everyone, I work in a teaching hospital in the UK, and am in need of some advice about the OMOP CDM.

We are just in the process of making the decision of which data model to use for a new clinical data warehouse coming as part of a programme of investment in IT. The first stage of the process has been to pull all of the data we need (ADT, Pharmacy etc.) into a noSQL staging area. From there, we want to ETL the data into something mere mortals like me can access…

We’ve been considering a few options. Two main contenders have emerged. One has been to load all of the data into OMOP. The other is to use some sort of derivative of i2b2, with a dimensional structure and a single EAV fact table (with a rudimentary conceptual taxonomy).

In making our decision between the two, OMOP is our preferred model for interoperability and tooling. Nevertheless, we have some reservations about the amount of work required to complete the semantic portion of the ETL. In particular, our project team is drawn from one specialty, while our data set spans several adjacent departments too - so we are concerned that mapping concepts from (predominantly uncoded) source data will be unfeasibly time-consuming.

Are there any tricks we can use to gradually build this semantic work (for adjacent specialties) into our ETL?

From what I understand, each OMOP domain contains columns for the source values of each data item — but the source concept still has to be attached to a recognised (if not standard) concept. Is there a way we could tweak this to accept our local dictionaries (in the interim)? Along similar lines, if we were to default to exploiting the EAV structure of the observation table for facts we struggled to initially pass into the other domains - would that be heretical?

Or from a different angle, can anyone share experiences of mapping (and then verifying) large datasets by hand with Usagi?

We have a paper under consideration on work we did mapping some of the Framingham study research data sets to OMOP using Usagi as a proof-of-concept for integrating clinical research data sets with EHR data using OMOP. It required some manual adjudication of the mapping, but worked quite well overall.

This is somewhat different from what you’re suggesting, though - Usagi associated each of the uncoded Framingham variables with an OMOP standard concept. If I’m reading you correctly, you’re proposing adding a custom vocabulary to OMOP (“Is there a way we could tweak this to accept our local dictionaries?”). I don’t see any technical barriers to doing it this way, but it raises the question of “why bother doing it in OMOP in the first place?” Why not just create a custom, localized, stripped-down data model, since anyone using the CDW would have to learn the localized vocabulary anyway?

If I’m understanding you correctly - and please correct me if I’m not! - it seems that it might make sense to run Usagi against your local concepts and then do a spot-check of the mappings to assess how much work will need to be done to make it something in which you have a reasonable degree of confidence. To be a bit more concrete, you could map all N of your concepts with Usagi, do a manual review of 100 of them, and if 95 of them have mappings that you’re satisfied with,you’d then know that you’ll probably only have .05(N) to manually reassign.

Evan, I really appreciate your taking the time to respond - many thanks. I will keep an eye out for your paper!

I completely take your point on the first question ('why bother?’). For just that reason, our medium-term goal would be to map the entire dataset to OMOP standard concepts.

I suppose the complicating factor is that for a relatively short window (the next 2 months or so) we have much greater access to software devs than we otherwise ever would. As a result, I am trying to understand whether the concept-mapping will be the rate-limiting factor - or whether that semantic work can be filled in after the bulk of the software development is done. Is there anything you would suggest?

Your idea re: Usagi is a good one. Thank you. We have had some similar thoughts floating around - but doing that in a systematic way is smart.

We have a similar issue – we like to separate out the vocabulary and table mapping from the data tables, and access/improve the mappings as needed. We built a data model that does this, available as a preprint while we are waiting for review (link below).

We designed it as a “waypoint” in the ETL process to go from raw data to either OMOP or Sentinel. If it is helpful, we are happy to talk through it with you. We use it internally, but I am sure it doesn’t solve all use cases. But it may be helpful in your use case.

1 Like

We (Odysseus team) have this experience.

This approach works well if you are good knowing that 5% of your mappings are wrong.
If you need to have them 100% correctly mapped, anyway you have to look through all the data set manually to find those 5% and fix them.
So if you need to have mappings you can relay on, you may empirically define the lowest match_score where the mappings are correct, and take only the mappings with the such or higher match_score.

Another thing is that you need correctly define the filters you will use in Usagi, i.e. domain, concept_class_id, vocabulary_id.
In most of the cases the source itself gives the understanding of what domain the concepts should belong to and knowing the standard vocabularies used with this domain you can play around with the concept_classes and vocabulary_id

Sometimes Usagi gives totally unpredictable results, for example in the Drug domain, so additional algorithms should be developed.

@TCKeen, let me know if you want to chat about this.

Not only TL;DR but also YMMV

Colorado has taken a different philosophy. @MPhilofsky is our local expert who can provide more details.

We move very quickly to getting source data at least into source_values so that we can work in the OMOP ecosystem. We create custom source_concept_ids with only very light vocabulary harmonization so that we do not lose the original data granularity. We do harmonize source_concept_ids that are obviously like-with-like but not a lot. At this step, there are no existing mappings from our custom source concepts to standard concepts so we have a ton of standard concept_ids = 0 (unmapped). But we at least can start using the OMOP structures, albeit using SOURCE concepts, even though these are not (yet) aligned with STANDARD concepts used by the OHDSI community. This is my answer to @esholle’s question about “why bother”.

We have no qualms violating the OMOP missive of “thou shalt not have any concept hierarchies (concept relationships) with non-standard concepts” (seek every opportunity to annoy Christian…! :wink: ) The easiest example is care sites. We have custom hierarchies of our custom care sites that allow us to group care sites by region (north region, central region, south region) or by specialities (cardiology clinics, ortho clinics, pediatric clinics).

We then begin the slogging work of mapping our custom source_concept_ids to concept_ids either based on frequencies or analytic use cases.

We had to do some trickery with USAGI to make this work but I don’t know the details. Again, @MPhilofsky knows all…

Interesting. Would you be willing to share your work with care sites. I know the framework/structure would be something most medical centers would want to emulate. Maybe there might be some way to marry standard practice and custom codes.

Not good: You cannot contribute to community Network Studies. You can only sit in your own stew.

This heresy alone and the previous statement: You know you will be frying a couple of eternities in OMOP hell, @mgkahn!!

The whole point is to be done with this data munging at the time analytics starts. And if you are in a standardized tool or method like ATLAS or CohortMethod you cannot start custom mapping. It’s like attaching the wheel when you are sitting in the car and already accelerated to 30 mph. Get on with it! :smile:

That’s actually not a problem. It’s just impossible to make that standard. It’s always local. And I suppose your are using care_site_id for the hierarchy it would be totally fine. Are you?

So this is where I push back on the scent of orthodoxy…

Most of our customers do not seek to engage in OHDSI community studies. They only have a local need. There is a “price” to pay in yet another layer of mapping from local source_concept_ids into OHDSI-network concept_ids (we have Epic Chronicles -> Epic Clarity -> Epic Caboodle -> OMOP). I know there are very cogent arguments that the OMOP terminology mapping is analytically helpful, especially for the large number of analytically mindless versions of “I don’t know”. For those who have a multi-institutional study in mind, they can leverage the standard side of the model. Those seeking only local data can construct queries in source_concept_id terms that make sense, admittedly only locally. Those who wake up one day seeking to move from local to network studies can convert their query from local-only source_concept_ids to OHDSI network standard concept_ids in minutes and they can see the impact of moving from local codes to network codes.

My objective is to get people thinking in the OMOP framework as quickly as possible and to leverage use cases to drive our resource-constrained terminology mapping efforts. To wait until mapping is “all done” (which really equals “never” given the enormous long tails of idiosyncratic, or idiotic, codes in our source systems) is to delay ANY use of the model. Folks can use the current tools with the current high-value / high-frequency mapped terminology and if there are critical terms missing, we can elevate them in priority.

To be clear, the model I am describing has not yet been fully adopted. But it is the approach I am pushing my group to adopt over the next few months.

1 Like

@mgkahn I can confirm that this is what we do too. We only use source vocabularies because it is easier for us, and our clients, to think in terms of the vocabularies they are used to.

1 Like

I expected no less. :smile:

All understood. But here is the thing: The value of a network is proportional to the square of the number of participating nodes (Metcalfe’s law). And by “participating” I mean being able to work together on a problem. And for that you need to be on the same harmonized model, otherwise we are back in (i) data munging, (ii) non-reproducible and (iii) it just won’t happen - country.

But here is the thing getting us out of the religious wars: How can we help? What would be the #1 (or #2) thing you’d require to get over the hurdle? You still have your internal jobs no problem.

Same whip being cracked, Mark. :slight_smile:

Again, I need to defer to @MPhilofsky, the person here who looks at very large spreadsheets of unique local codes that need mapping every day, to respond to how to move beyond just USAGI for mapping support. There is so much local context (how is this term used in clinical practice/workflows? how do our investigators want to use this term in their research queries? when is it “safe” to declare this term to be semantically close-enough to be mapped to an existing source_concept_id or concept_id, letting the source_value maintain the ground truth source value and a new keystore table we are implementing to get back to the exact tuple in the source if need be). I honestly do not know how she keeps all of these competing features of mapping in her head.

I can also say that changes valid codes across vocabulary refreshes, especially those that become invalid and disappear, are an issue we haven’t completely come to grips with. Glad to see the conversations about persisting deprecated comcept_ids (if I’m following those conversations correctly).

I understand, and I accept the whipping.

But I will make the point that there is another way. OMOP is simply a “view” on the raw data. So are Sentinel, and PCORnet. These views are only useful for making analyses easier – by trying to express research ideas in a uniform manner. If you think about it this way, then there is a more generalized approach where you simply organize the data properly one time, and then you can “view” the data in the OMOP, Sentinel, PCORnet, etc manner for analyses. In other words, there is a lot of value to separating the data model from the analysis model – which are conflated in OMOP, Sentinel, and PCORnet. By doing this, we avoid the religious wars about which network to participate in, and it widens the landscape for possible collaborations. That is true harmonization, and this is why we don’t limit ourselves to OMOP. And why we created the data model I just described. And why we build software that is data model agnostic.

1 Like

@Mark_Danese:

I understand you are trying to be multilingual, just like the automated voice when you call your health insurance “para Espanol o prima dos”. But really, all the relevant communication happens in English around here. Because it’s hard enough already, and to get everything bilingual, or trilingual in your case, is an enormous burden, either for the conversions, or for the tools, or for the analyst who has to know all the idiosyncrasies.

But this is actually not about OMOP vs the others. It is about speaking proper OMOP, to be able to communicate to the others, which requires saying things in commonly understood vocabulary. Only then the queries are truly interoperable.

"And why we build software that is data model agnostic.” It seems to me that something like a relational database management system would be truly agnostic, but the Generalized Data Model (GDM) is yet another data model that you have built optimized for ETL. When someone in the future decides to create an Improved Generalized Data Model that is not backwards compatible with GDM (because it uses a network or OO architecture instead of hierarchical), there will be more religious wars. And at some point you want to use the data to do something, and with 100,000,000s patients each with 10,000s of variables, the physical clustering of the data and the indices do matter and you have to make a decision how to actually store it.

@hripcsa You are correct – there is always a difference between the logical model (i.e., how the parts relate to one another) and the physical model (i.e., how the data is stored). The trick in all of this is to figure out when one is moving data around to organize it, and when one is translating the meaning of the data. Moving the data is more backwards compatible. Translating the meaning is less backwards compatible, depending how it is done.

For GDM we optimized it for flexibility and don’t actually specify hierarchical vs. relational. We try and present it both ways, for exactly this reason. And we have played with hierarchical versions of OMOP too, for speed reasons. As you point out, the real issue is how to get things to run on a wide variety of data sources in a reasonable time. The alignment between the data model and the database storage is very important for data with hundreds of millions of people.

I am sorry if I have taken us off track here. I think the discussion was about whether it is ok to do the mapping later and my response was that the OMOP model has a life of its own outside the network and OHDSI tools, and this life doesn’t require all of the mappings. And, in my opinion, that is ok.

1 Like

I will. Let me tell you a little about our business because it influences our decisions.

Our/University of Colorado’s reason for using The OMOP CDM is a little different than most active forum users. We use the OMOP CDM as our primary data warehouse and deliver (just starting this phase now) datasets to our customers (providers, managers in the hospital system doing QA/QI, PhD students, professors, and all our data collaborators). We do incremental loads from the source through the pipeline to OMOP every day. We do not do the analytics. If we can’t use the OMOP CDM to deliver datasets to our customer, maintaining OMOP would be prohibitive. All these factors give us a different perspective.

We did this mostly in parallel with the goal of mapping

codes first. The Person table is easy, the Visit table took more work, but only because the source has a different view of a visit than the OHDSI community. It took a couple days of using Usagi to go from 86% to 99% of drugs administered to a patient being mapped to a standard concept, which was easy because drug names are fairly standardized. Conditions and Procedures are tough because the terminology varies from source names to target concepts. And social history (tobacco, alcohol, drugs) is a beast. We are also working on other data that belongs in the Observation table. This data is mostly unmapped, but also not used in network studies (pain, scoring systems, ventilator settings, etc.) @Christian_Reich.

Organizing the data is a lot of work and very error prone. As @mgkahn stated, we take very large spreadsheets of local codes and put them in to Usagi one domain at a time. Then the output is altered to fit the Concept and Concept Relationship tables (@Christian_Reich help here) - also an error prone step. Next is uploading them to the necessary tables and testing for accuracy. All of this is very time consuming. Yes, a lot of work, but the time savings, especially with using the hierarchies, makes the effort worth it, I hope :wink: We haven’t put this into production.

I would put in only the necessary data, not everything, just what is going to be used, in to the source_value column during the ETL. I suggest putting in the local code and the name separated with a delimiter. You will want to get on the standard side of the model as quickly as possible to leverage the vocabulary. The power of the OMOP CDM is in the vocabularies! Leveraging one concept and its hierarchy is much easier than using a string search to find all relevant concepts and then listing out each concept (error prone).

Put the data where it belongs or the tools won’t work. But if it doesn’t belong in any other domain, it goes into the Observation table and SQL will pull it out.

I’m happy to share or elaborate, @TCKeen!

Yes, we skipped the “ugly” Fact Relationship table and created a Care Site Hierarchy table instead. It’s connected to the OMOP Care Site table via FK. It follows the OMOP Care Site table structure, except it is 6 layers/fields: system, region, site, operational group, facility, and department. Same structure just expanded. And it’s very similar, if not exactly the same, to our source system representation of care site.

1 Like
  1. Here’s the thing, we are having to create custom, “standard” concepts to represent all of the things/ideas/datum that do not have a standard representation from a standard vocabulary. My latest example is pain scores. We have ~650 unique representations for pain in our source. Our customers want all that data to study who has pain, are people being asked about their pain, what is their pain score, how are they treated, how effective is the treatment, and so on. Some of these map to standard concepts ( 0-10 pain score, pain duration, pain characteristics) and others don’t (pain intervention, response to relief measures, different pain scales). So, I have taken the liberty to create custom, standard concepts :open_mouth:. These are concepts that would otherwise be mapped to concept_id = 0. I have also created hierarchies in the Concept Ancestor table for these now standard concepts to make life easier for the query writers. One concept and all its descendants or 650 individual concepts? And that 650 is an increasing number. If these things had a representation from a widely accepted vocabulary, I would bring it to the OHDSI vocabulary folks, but they don’t. Pain is all over the news and the literature, but lacks standardization. However, that doesn’t stop the researchers from wanting to research it or the QA/QI folks from wanting to know what’s happening in our house.

2 This also keeps me awake at night. I am looking forward to a solution that doesn’t involve me remapping codes every time a vocabulary is refreshed.

1 Like
t