SQL schemas to support multiple OMOP CDMs

krfeeney · January 3, 2017, 9:28pm

HNY all!

My team is working to install and deploy OHDSI tools for multiple OMOP v5 CDMs (the usual suspects… Optum, Truven). There’s a debate ensuing about how to support multiple data sets in the SQL schema.

My one colleague says, “If the end user want to access both datasets at the same time or through the same URL, both datasets need to be loaded into the same set of tables in a single schema, i.e, all patient’s data from both Truven and Optum need to be loaded the same OMOP PERSON table. If the end user want to access both datasets separately, they should be loaded into different schemas, i.e. different set of tables, and then they will be accessed from separate URLs for OHDSI tools (installing two set of OHDSI tools).”

I’ll admit. I’ve been combing the physical documentation and am scratching my head on how to work through this. I just know this isn’t quite right. The OMOP CDM is a data mart by design. I would not be collapsing all data into the same OMOP PERSON table. I would retain each instance separately. I would also guess that it’d be very odd to have to reinstall the suite of tools for every single data mart. I understand running new Achilles JSON files per data set but a full ATLAS installation for each data mart seems excessive.

I read an interesting comment by [quote=“Rijnbeek, post:1, topic:1902”]
What I was thinking about is a central server that hosts the CDM and vocabularies (read-only) and add a writable scratch schema there (as is done in JnJ).
[/quote]

So I’m wondering, can someone direct me and my team to the right installation documentation to support running tools on multiple data marts? I’m certain JnJ (@Patrick_Ryan and @Frank) has mastered this but I’ll admit, I’m grasping at straws at how to get my ETL team to understand how to implement this in practice. Looking for some breadcrumbs on how to lead us down a more sound path.

Many thanks!!
Kristin

Christian_Reich · January 3, 2017, 10:59pm

@krfeeney:

Happy New Year to you too.

We generally have a problem with “if the end user wants”. What is the use case? I’d never make any such decision without understanding what problem they actually wants to solve.

Does the “end user” want to pool data to get better sample size? If so, you can always create a view by unioning the same tables from different schemas. But the problem with Truven and Optum is the data are not really that additive. You have the same patients in both databases with different enrollment periods. Otherwise Truven + Optum + PharMetrics would be greater than the entire population of the country. And you cannot de-duplicate them unless you have explicit permission of the data providers, because it requires re-identification of patients.

Bottom line: Everybody I know keeps the data sources in separate schemas.

Chris_Knoll · January 3, 2017, 11:46pm

Hi, @krfeeney,

The Webapi/Atlas tools support multiple CDMs being referenced by a single WebAPI instance. What you should do is tell your ETL team to populate the CDM tables in their own, vendor specific CDM schema. Your colleague is incorrect that you must push all the data into a single CDM schema, in fact internally we make an explicit rule that we do not mix source data together into the same CDM instance.

You haven’t said what DB platform you are working on, so I’ll offer a RDBMS-agnostic solution for you.

First, we’ll pretend that you’re on a postgresql database named CDM_DB (for JDBC url examples) and you can create separate schemas for each vendor: vendor1, vendor2, vendor3.

You would need to execute the OMOP CDM DDL 3 times (one for each vendor above) but modify the DDL to create the tables specifically in the vendor1, vendor2, and vendor3 schemas. Note: these schemas include the subset of tables for the CDM Vocabulary, you will need to create and load up the Vocabulary tables for each CDM.

When you set up WebAPI, another database (we’ll call it OHDSI_DB) will need to be created and a schema created in this OHDSI_DB (we’ll call it ohdsi) that will automatically get tables created in it when WebAPI starts up. We can come back to the details of how you’ll need to set this up, but for now, I’d just like to end on the note that you should just plan on different schemas to store the CDM tables, one schema for each CDM ETL’d from a vendor.

After you have this set up, and you have concrete details about what your environment is going to look like (what platform you’re using, your CDM database name, your different CDM schemas that are in it) I can follow up with your environment specific instructions to get you set up.

-Chris

krfeeney · January 6, 2017, 3:55pm

Thank you both, @Christian_Reich and @Chris_Knoll!

I appreciate the sanity check. My team is working through the DDLs for the data sets. They’re quietly moving forward. Chris, I’ll definitely be in touch with more questions as things come together.

Thanks again!!
Kristin

krfeeney · February 3, 2017, 2:19pm

@Chris_Knoll - took a bit of time but we’re finally in a place to continue this conversation. My colleague and I are testing our ATLAS deployment and thinking there’s a few places our schema mappings are causing bugs in the WebAPI. Any chance we could exchange a few screenshots via email with details of our backend configuration or do a quick call to walk through? We appreciate the extra set of eyes!

Chris_Knoll · February 3, 2017, 3:04pm

Sure, I can also set up a WebEx and we can share screens that way, and give a bit more real-time feedback. Let me know what your availability is (please private message me details on how to reach you) and I’ll coordinate a meeting.

-Chris