OHDSI Home | Forums | Wiki | Github

Best Approach for Combining Harmonized Data from Two ETLs

Hello everyone!

I’ve successfully completed ETL processes for two datasets from two different hospitals, with the goal of harmonizing them for joint analysis. For the ETL procedures, I used the open-source repository provided by The Hyve (https://github.com/thehyve/ohdsi-etl-prias).

Now, I’m wondering what the best approach is to combine the results of these two ETLs. Should I merge all harmonized data into a single schema, or is it better to maintain two separate schemas for each dataset and perform the same research question in two steps?

I would appreciate any advice or best practices on how to proceed with this!

As with most questions, it depends on your use-case. We have done it both ways. There are pros/cons to each approach.

If you want to look all the data the same, regardless of source, then merging them together makes sense, and possibly using the _type_concept_id to identify the different sources. The difficulty is “merging” the data and removing duplicated data, or not, depending on your need. How do you identify a patient as the same patient across sources, etc. PS. That is a very large topic. For example, an EHR may provide a drug administration as an NDC, but the claim system uses a HCPCS JCode. Do you make this 1 record, or 2 records?

If you want to use the data as source specific, you may want to treat them separately. Then if you later want them together, you face the same challenges above.

1 Like

Thank you very much for your reply!!

t