Lessons learned and advice needed for ETL (of Medicaid data (or other) data)

Vojtech_Huser · May 21, 2021, 11:19pm

At NLM, we got funded to convert data in CMS VRDC CCW (see CMS Virtual Research Data Center (VRDC) | ResDAC Data Dictionary) into OMOP.

We are doing it in SQL (in Spark dialect; Spark inside DataBricks).

We are trying to learn from other ETLs as much as we can.
And also - from the ETL course on EHDEN academy. (tutorial github available there)

But we could still use some advice from folks who did their “in house” or any other ETL into OMOP.

Our input data is very very large so we need a robust ETL. We are struggling with infering correctly visits from just claims (claim headers) . Given our platform, I don’t think we can run ETL-CDMBuilder. Well, I am not sure we fully understand how that software would work (if .net; advice welcommed)

ETL advice we used so far:
For example, for Truven CCAE, we only have some artefacts. (not the COMPLETE code). (also it is 2 years old). Here ETL-CDMBuilder/man at master · OHDSI/ETL-CDMBuilder · GitHub
(Also, how do we know which to use, TRUVEN_CCAE_MDCR or
TRUVEN_CCAE_MDCR_MDCD

We know about this Korean ETL (repo link not pasted, just example for one table) ETL---Korean-NSC/040.Person.sql at master · OHDSI/ETL---Korean-NSC · GitHub

For UKBB, only mapping will be public (I think) (not the actual SQL code of the ETL).

Relevant forum post are Search results for 'etl' - OHDSI Forums

I will also reach out to folks I know have CCAE and probably will have some version of their ETL. Note that we want friendly advice and pointers to public resources, not a “quote from a vendor” that for a price they are sure to help us.

DTorok · May 24, 2021, 2:13pm

Cannot provide ETL code, but we have done numerous ETL’s on RedShift. All our ETL’s are written in SQL and then use some procedural language, to call the SQL scripts and test for errors. Suggest that for development you create a small representative one or five percent sample of you source data set, will make development testing much easier. Creating the sample usually involves randomly selecting a sample of patients and then getting all the records for those patients.
Other suggestions: You may try using a single staging table for medical events. I think this is similar to STEM table in White Rabbit. Discussed here https://www.ohdsi.org/2019-us-symposium-showcase-55, with some code samples in sandbox/CDMv5 at master · OHDSI/sandbox · GitHub. Another aid to simplify vocabulary mapping to to create a ‘source to concept table’ for the vocabularies of interest. This custom STCM table has the source concept id, the ‘mapped to’ concept id and the ‘value as concept id’ all in a single select. Same git directory as above.

Dave.Barman · May 25, 2021, 7:24pm

Hi Vojtech,

With larger conversion our best practice at Odysseus is to break down the ETL into parts to facilitate the joins and logic for determining where the source data will end up in the CDM. I will explain at a high level how this works, and I hope it is helpful.

Pass 1) Staging - for the scoped source data that will be part of the conversion - isolate this data into its own set of tables 1:1 from the data source. This can be in the “stg” schema for example"

Pass 2) Cleansing/Filtering - apply data cleansing rules to the source data where applicable to CDM conventions. In some cases this may need to be performed later for different rules which would be identified in later steps. Filtering the data which is not wanted or not to OMOP convention is also the consideration at this step.

Pass 3) Dimension - in most cases, the dimension table logic is straight forward, concepts can be mapped via case/decode logic, or joined to Athena vocabulary in a simple way to identify most of the concept IDs that will be used

Pass 4) Clinical Event tables - to get the source data to the right CDM tables, we must pass through several steps and this is where you can start building your ETL process to scale appropriately. Depending on the complexity of the source data, there may be more or less of these steps - but I will attempt to break them down here.

Pass 4a) Accumulate source data per expected domain - write rules for each of the source tables (now in stage) that will populate a holding table. “temp” or “work” is a naming convention that can be used for these. So for example, diagnosis data may be present in 2 or 3 different tables, and possibly amongst different fields (claims data may have a billing code, but you may also want to use data from a problem list field, so it depends). These separate rules should be written as separate ETL statements. The ETL is gathering this data into the domain level “temp” table. Repeat this for all of the CDM domains.

Pass 4b) Clean “work” tables - identify additional cleansing rules that need to be applied to the domain level data - naming convention would be “work_clean” or “temp_clean”

Pass 4c) Join the cleansed tables to vocabulary - finally join the tables to your vocabulary to populate concept_ids (for a final CDM, this would include custom mappings). These tables we can call “work_clean_mapped” and so-forth.

Pass 4d) Populate CDM tables - from the “work_clean_mapped” tables - identify final domains and route records to appropriate CDM clinical event tables - you can make this easier by storing the vocabulary domain in the “work_clean_mapped” so you can eliminate this costly join

Pass 4e) Deduplication - When combining from multiple source tables it is possible for there to be overlap in the records - so a deduplication step is needed to eliminate these multiple records of the same event.

Pass 5) Derived CDM tables - there are many shared scripts to use for this available from the community

This process should be iterative with any effort for custom mappings for vocabulary in the source data. I also suggest using a sample of the data - so it is more manageable while developed and testing your ETL code.

jon_duke · May 25, 2021, 8:26pm

@Vojtech_Huser We are looking at converting the ResDAC Medicaid data as well (we have 2006-2017, so still need to get the latest years). We would be very interested in working with you all on the ETL development process if that would be of help.

Best,

Jon

Vojtech_Huser · June 15, 2021, 11:48pm

Thank you for all replies here and via email.

We are making progress.

Few points :

For DDL translation into Spark dialect (and to render Achilles), we would benefit from SqlRender support for Spark dialect (@schuemie, @Ajit_Londhe )
2B range concepts and tables like cte_vocab_map (and using (or not using) concept_relationship) (use ‘maps-to’ or convert directly to target concept and not using 2B local concepts at all)
type concepts advice (e.g., observation periods come from Insurance record and we don’t see good concept for that; there is EHR type which we are using currently)
“stem” table examples (like Don provided above) and past ETLs using those is very helpful
size of the input data and solving challenges related to that (despite using Spark - it may be a problem)
new OHDSI repo just for ETLs

Ajit_Londhe · June 16, 2021, 5:39pm

We are nearing a solution for providing a Databricks test cluster to @schuemie . Hoping to have it in the next 2 weeks. That will help us merge the Spark code into SqlRender.

MPhilofsky · June 22, 2021, 3:34pm

Use this concept_id = 32881

If you don’t create 2B concepts and concept_relationships, the mappings won’t be visible in Atlas.

Vojtech_Huser · September 20, 2021, 6:20pm

Just to update this thread, we continue to work on OMOP ETL at NLM/NIH. It is a multi-year plan so see my next post here on the forum to provide further update.