Best Approach for Combining Harmonized Data from Two ETLs

manoshatzak · October 15, 2024, 2:12pm

Hello everyone!

I’ve successfully completed ETL processes for two datasets from two different hospitals, with the goal of harmonizing them for joint analysis. For the ETL procedures, I used the open-source repository provided by The Hyve (https://github.com/thehyve/ohdsi-etl-prias).

Now, I’m wondering what the best approach is to combine the results of these two ETLs. Should I merge all harmonized data into a single schema, or is it better to maintain two separate schemas for each dataset and perform the same research question in two steps?

I would appreciate any advice or best practices on how to proceed with this!

TPhillips · November 22, 2024, 6:33pm

As with most questions, it depends on your use-case. We have done it both ways. There are pros/cons to each approach.

If you want to look all the data the same, regardless of source, then merging them together makes sense, and possibly using the _type_concept_id to identify the different sources. The difficulty is “merging” the data and removing duplicated data, or not, depending on your need. How do you identify a patient as the same patient across sources, etc. PS. That is a very large topic. For example, an EHR may provide a drug administration as an NDC, but the claim system uses a HCPCS JCode. Do you make this 1 record, or 2 records?

If you want to use the data as source specific, you may want to treat them separately. Then if you later want them together, you face the same challenges above.

manoshatzak · December 11, 2024, 8:24am

Thank you very much for your reply!!