Dear Arthur,
You raised a few important additional questions here.
TriNetX and Truveta - both commercial, VC funded organizations - are investing heavily in converting EHR data to build their platform/network. And for both, there is probably a decent cost and effort associated with each EHR conversion. I am sure that they have developed a good an effective internal (not open source) framework and ETL accelerators that would help them to perform those ETL’s in the most effective manner. I do not believe either of those organizations use OMOP their target ETL schema but rather something that they developed internally and it is proprietary to them.
Similarly to both of these, Odysseus is also in a business of building the analytics platforms (ARACHNE, Prometheus, ATLAS) and distributed networks. Just recently, we announced a release of a new version of the distributed analytics platform (Prometheus) and pro-actively involved in building several federated data networks for both commercial use (Odysseus as well as customer specific) and open science research (EHDEN, IMI Pioneer, OHDSI and other). Except that Odysseus adopted open source OMOP as our target analytics standard and ETL schema. Internally, we also developed an effective ETL and DQ platform/framework (Argo and Statius) that allow us to re-use the ETL code across projects and do conversions into OMOP in an effective and quality manner. Yet, in most cases it is not an out-of-the-box plug and use - it is still quite a bit of an undertaking for each ETL due to the reasons I described above.
Why does it matter? There are a couple of important points to remember:
-
OMOP is a way to standardize on analytics data. For that, there needs to be the data harmonization process from raw (aka non-standardized) data which is the whole purpose of the ETL. If ETL would have been highly re-usable, that means that the source data is also already both clean (data quality) and standardized which is - unfortunately - is not the case today.
-
The first and the most important scope question of any ETL project is how much data will be converted. The less data is being converted - the easier it would be to re-use the ETL code but that would also limit the number of analytics use cases. The more data - the less re-usable and more costly the ETL would be, but the number of use cases expand. OMOP allows for a very broad set of use cases and the scoping of the initial conversion is very important or the budget will quickly be in trouble.
-
Over the years, Odysseus developed an internal use-case driven scoping approach that we offer to our customers and data partners that would allow us to be very specific on what data will be converted and thus achieve an ETL cost reduction and better ETL re-use. We call them "tiers’ - “essential”, “standard”, “integrated”.
-
Similarly, I believe that @clairblacketer (Clair) who is leading OHDSI Data Quality WG that is helping to build the OHDSI network and identify a more targeted approach to what minimum required data should be converted to support OHDSI research. That would also help with higher ETL re-use and scoping.
I didn’t say it is impossible to develop a re-usable ETL. I said that today you will not find a highly re-usable open source ETL code base and listed the reasons why. To develop a highly re-usable ETL - the source (raw) data needs to be better standardized - which is not the case today. However, if you are re-using Epic to OMOP ETL code - yes, of course there will be a decent re-use. At the same time, trying to re-use Epic ETL on eClinicalWorks or Cerner ETL is not going to work too well cause source schemas a very different,
My personal prediction is that at some point in the near future many EHR system vendors will recognize OMOP as the de-facto data analytics schema and will simply develop an out of the box plugin that would allow OMOP generation directly from their systems.
btw, @MPhilofsky (Melanie) is running an EHR to OMOP WG, that is probably something that could also help you in your undertaking.