I’ve successfully completed ETL processes for two datasets from two different hospitals, with the goal of harmonizing them for joint analysis. For the ETL procedures, I used the open-source repository provided by The Hyve (https://github.com/thehyve/ohdsi-etl-prias).
Now, I’m wondering what the best approach is to combine the results of these two ETLs. Should I merge all harmonized data into a single schema, or is it better to maintain two separate schemas for each dataset and perform the same research question in two steps?
I would appreciate any advice or best practices on how to proceed with this!
As with most questions, it depends on your use-case. We have done it both ways. There are pros/cons to each approach.
If you want to look all the data the same, regardless of source, then merging them together makes sense, and possibly using the _type_concept_id to identify the different sources. The difficulty is “merging” the data and removing duplicated data, or not, depending on your need. How do you identify a patient as the same patient across sources, etc. PS. That is a very large topic. For example, an EHR may provide a drug administration as an NDC, but the claim system uses a HCPCS JCode. Do you make this 1 record, or 2 records?
If you want to use the data as source specific, you may want to treat them separately. Then if you later want them together, you face the same challenges above.