Open source CDM ETL systems

arthur.goldberg · July 31, 2023, 7:37pm

Hi OHDSI

One of the biggest challenges a medical center faces when creating an OMOP CDM is building and regularly running an ETL of the CDM’s inputs, which often comprise an EHR and multiple other sources such as radiology, lab systems, etc.

Is there a list of high quality CDM ETL systems that have been open sourced, and could be a good starting point?

Regards
Arthur

Frank · July 31, 2023, 8:35pm

Here is one example:

https://github.com/OHDSI/Perseus

jmethot · July 31, 2023, 9:40pm

If the EHR is EPIC, @roger.carlson has generously shared his ETL code from Clarity tables to OMOP CDM. Epic prohibits public exposure of the Clarity schema, so that code is shared via Epic UserWeb, which is available to staff of Epic customers. Roger’s latest version is here: userweb.epic.com/Thread/119330/OMOP-The-Spectrum-Code-Reboot/.

arthur.goldberg · August 1, 2023, 1:15pm

Thank you @Frank and @jmethot . These are very helpful.
Arthur

gregk · August 1, 2023, 5:33pm

hi @arthur.goldberg

Unfortunately, you will not find a lot of open source code for EHR to OMOP conversions. There are multiple reasons for that:

EHR vendors restrict publishing their internal schema (Epic for example, as John mentioned)
Epic is a highly customizable system and the actual re-use is pretty low. Doing Epic to OMOP conversion is almost a new project every time
Most of the EHR conversions is not just a single EHR to OMOP, these project typically involve multiple other data sets, including from other data sets (claims, registry etc…)
in most of the EHR conversions, the raw data is not being directly sourced from the EHR itself but rather form some intermediate internal DW schema (for the reasons above) which would vary from an organization to organization. So, publishing something like this would not even make sense.

Hopefully, with FHIR to OMOP initiative a more re-usable approach to EHR to OMOP ETL is coming, but it is not here yet…

Odysseus have done dozens EHR to OMOP conversions (Epic, Cerner, GE, Allscripts, eCW etc…) and do have a lot of experience in that space. Any help needed - please feel free to reach out (gregory.klebanov@odysseusinc.com)

btw, Perseus is not an “EHR to OMOP” codebase - it is a nice graphical ETL tool (like Informatica, Talend etc…) that can be used to create OMOP ETL projects but if you are looking for something as an actual EHR to OMOP example - not sure how much actual help that would be for you, it is just a shell

arthur.goldberg · August 3, 2023, 5:28pm

Thanks for your extensive reply @gregk .

Your points are well taken, comprehensive and highly credible, given your position as CEO at Odysseus and Odysseus’ experience and expertise.

But your overall conclusion that it’s hard (perhaps impossible) to build reusable ETL code and data for EHR to OMOP conversions makes me wonder how firms that do many conversions of EHR and data from other sources, like image repositories, to another clinical data warehouse manage to achieve economies of scale.

For example, companies like TriNetX and Truveta each have dozens of deployments. While I know nothing about their internals, if I were one of their CEOs or CTOs I would devote considerable effort to minimizing the amount of customization needed for each new deployment. Don’t they face all of the challenges you list above that make it hard to build reusable conversion tools? Perhaps people who know about these firms will say that despite their efforts they also need to devote a large amount of expert effort to customize each deployment.

Regards
Arthur

Frank · August 3, 2023, 6:28pm

I suppose we would have to define what we mean by an actual example. The ETL Synthea project is an actual example of an ETL conversion between a native person-centric data source and the OMOP CDM. The ETL Lambda Builder open source code is another example where it was developed as an abstracted approach to doing CDM conversions that was configurable for different data sources. This was done much for the reasons @arthur.goldberg enumerated, which I found to be very valid. It’s use is very limited because of the technical decisions that were made to run it in our infrastructure at the time, over a decade ago. Perseus is more of an open framework and tool but it has been build by an organization that has done many CDM conversions as well.

No framework is perfect and every environment and data set has its own unique features so flexibility is paramount and performance is critical for the efficiency of data operations pipelines we are all trying to establish. Ultimately I think the data owners and their expertise and knowledge of their data sources are more valuable than any framework or consultancy and the choice of tools for your conversion should be focused around an approach that can evolve as the data landscape changes, as it does, hourly.

gregk · August 4, 2023, 10:21am

Dear Arthur,

You raised a few important additional questions here.

TriNetX and Truveta - both commercial, VC funded organizations - are investing heavily in converting EHR data to build their platform/network. And for both, there is probably a decent cost and effort associated with each EHR conversion. I am sure that they have developed a good an effective internal (not open source) framework and ETL accelerators that would help them to perform those ETL’s in the most effective manner. I do not believe either of those organizations use OMOP their target ETL schema but rather something that they developed internally and it is proprietary to them.

Similarly to both of these, Odysseus is also in a business of building the analytics platforms (ARACHNE, Prometheus, ATLAS) and distributed networks. Just recently, we announced a release of a new version of the distributed analytics platform (Prometheus) and pro-actively involved in building several federated data networks for both commercial use (Odysseus as well as customer specific) and open science research (EHDEN, IMI Pioneer, OHDSI and other). Except that Odysseus adopted open source OMOP as our target analytics standard and ETL schema. Internally, we also developed an effective ETL and DQ platform/framework (Argo and Statius) that allow us to re-use the ETL code across projects and do conversions into OMOP in an effective and quality manner. Yet, in most cases it is not an out-of-the-box plug and use - it is still quite a bit of an undertaking for each ETL due to the reasons I described above.

Why does it matter? There are a couple of important points to remember:

OMOP is a way to standardize on analytics data. For that, there needs to be the data harmonization process from raw (aka non-standardized) data which is the whole purpose of the ETL. If ETL would have been highly re-usable, that means that the source data is also already both clean (data quality) and standardized which is - unfortunately - is not the case today.
The first and the most important scope question of any ETL project is how much data will be converted. The less data is being converted - the easier it would be to re-use the ETL code but that would also limit the number of analytics use cases. The more data - the less re-usable and more costly the ETL would be, but the number of use cases expand. OMOP allows for a very broad set of use cases and the scoping of the initial conversion is very important or the budget will quickly be in trouble.
Over the years, Odysseus developed an internal use-case driven scoping approach that we offer to our customers and data partners that would allow us to be very specific on what data will be converted and thus achieve an ETL cost reduction and better ETL re-use. We call them "tiers’ - “essential”, “standard”, “integrated”.
Similarly, I believe that @clairblacketer (Clair) who is leading OHDSI Data Quality WG that is helping to build the OHDSI network and identify a more targeted approach to what minimum required data should be converted to support OHDSI research. That would also help with higher ETL re-use and scoping.

I didn’t say it is impossible to develop a re-usable ETL. I said that today you will not find a highly re-usable open source ETL code base and listed the reasons why. To develop a highly re-usable ETL - the source (raw) data needs to be better standardized - which is not the case today. However, if you are re-using Epic to OMOP ETL code - yes, of course there will be a decent re-use. At the same time, trying to re-use Epic ETL on eClinicalWorks or Cerner ETL is not going to work too well cause source schemas a very different,

My personal prediction is that at some point in the near future many EHR system vendors will recognize OMOP as the de-facto data analytics schema and will simply develop an out of the box plugin that would allow OMOP generation directly from their systems.

btw, @MPhilofsky (Melanie) is running an EHR to OMOP WG, that is probably something that could also help you in your undertaking.

MPhilofsky · August 7, 2023, 4:48pm

Hello @arthur.goldberg and welcome to the OHDSI journey

The Healthcare Systems Interest Group (HSIG) supports healthcare systems on their OHDSI journey. The HSIG is located here on OHDSI’s MS Teams environment. We meet every other Monday at 10am Eastern Time.

Thomas_White · August 8, 2023, 1:15pm

Truveta has an internal data model that is largely based upon FHIR. Each data contributor is responsible for building their own ETL to support daily incremental updates to that target model. Truveta then does additional enrichment, linking, and de-identification. Several of the major data contributors have been working for over a year to convince Truveta to also support an OMOP view of the data (either for data submission or for end-user analytics).

funaesthesia · August 11, 2023, 9:09am

We worked a lot with Python and Django for ETL systems. Django is used as the ORM layer and can connect to Oracle, PostgreSQL, MySQL, MariaDB, and SQLite databases natively making it ideal if you have to integrate different systems in our hospital infrastructure. From what I have seen so far Python is way too underused for such tools. A lot of them are written in R or Java and have a steep learning curve.

I guess a good way to start would be to bring together the community and develop a tool written in Python using data frame libraries (pandas, polars, etc.). One could develop a frontend down the road.

pdejaege00 · September 12, 2023, 1:44am

A CLI ETL tool that works on Google cloud: https://github.com/RADar-AZDelta/Rabbit-in-a-Blender