OHDSI Home | Forums | Wiki | Github

Database platform support revisited

(Martijn Schuemie) #1

A constant theme in OHDSI appears to be the desire to support more and more database platforms, as you can see for example in this thread, this thread, and this thread. I just received an e-mail from an enthusiastic contributor interested in adding Snowflake.

The current state is that we have a core of well-supported platforms, including SQL Server, Oracle, PostgreSQL, and PDW. To my knowledge, all our tools run smoothly on these platforms, and for all except PDW we have unit tests with continuous integration in place to ensure (to some extent) that it stays that way.

Then there are other platforms such as Impala and BigQuery that many in the community have contributed to, but which I am not sure are currently truly supported by the entire OHDSI toolstack.

As you can read in the threads I referenced, I’m in favor of limiting the number of platforms we support, instead focusing on delivering maximum reliability on the platforms we do support. Others disagree, and appear to be willing to provide support for other platforms. However, this has the danger of creating a patchwork of support across our tools. For example, I’m unsure CohortMethod runs on BigQuery or Impala, and if it doesn’t, I would not be able to debug it.

If I can’t have my way (supporting only a few platforms well), then at least we should be transparent about the extent to which we support each platform. For this purposed I’ve set up a strawman Wiki page that tries to list what is required to support a database platform, and has names attached to each tool-platform combination. The idea is that that person is responsible for ensuring the specific tool runs on the specific platform.

Let me know where you stand. Is supporting many platforms a bad idea in general? Are you willing to put down your name in one of the empty cells? Could the ATLAS/WebAPI team help clarify what they think is required to support their software?

(Gregory Klebanov) #2

Hey @schuemie,

Here is what we discussed with Patrick, or at least in the context of ATLAS.

  1. There are “core” OHDSI databases. Those are being supported by OHDSI community - meaning features are developed, tested and supported against them by the OHDSI community core team. I think you correctly listed them as SQL Server , Oracle , PostgreSQL , and PDW. In all fairness, I am not sure how PDW (or Oracle or SQL Server) made this list since those are commercial database that require license. Or is based on community adoption - then would be nice to capture community users, so that we know this is true - at least.

  2. There are other databases that are evolving as a part of ATLAS adoption. The model here is simple - organizations or commercial vendors can propose to provide support for those, at least until there is a need. For example, Odysseus is currently providing support for RedShift, Impala and Google BigQuery since we have commercial customers using these. i also believe James Wiggins and Amazon do provide ongoing support for RedShift as well and we might have moved RedShift into the first category (“core”).

That raises two important questions - also raised multiple times.

  1. To be able to provide development and support as a part of community - there needs to be community infrastructure. Similarly to you, Odysseus has no access to 3 out of 4 core database since they require a commercial license yet we are a part of the “core” team doing development and support for ATLAS. I know that as a part of ATLAS 2.8 work, Lee has been working to set up a truly shared infrastructure but not sure what the latest status of this is. Also, PWD would definitely not fit into that either way

  2. When we mean “provide support”, do we mean ATLAS or a broad OHDSI ecosystem. Right now Odysseus was primary focused on ATLAS and doing some fixing for PLE and PLP packages. Martijn just mentioned the CohortMethod package - and there are certainly other packages that are important for OHDSI as well. Not only this is significantly expanding a scope but we would also need to decide what makes up a list of “gold” OHDSI packages and tools that need to be supported - otherwise it could be a universe of things.

I think it certainly has its challenges but the truth of the matter is - having a need for various database is inevitable. We cannot expect that all OHDSI members - being a worldwide community - would require all members to use PostgreS (not even going to scale beyond small size database) or guide them to buy SQL Server or Oracle licenses. In any mature organization, there are organizational corporate standards and those exist for a reason. At the same time, I do believe if some organization wants to add a database they use for OMOP hosting - they either should provide support for that or find a commercial vendor who will (and able to).

(Stephen Pace) #3

Hi, speaking as the enthusiastic contributor mentioned by Martijn, I recognize the difficulty of supporting additional platforms. At my previous company, we had a product with millions of lines of code, and each new port added a huge amount of testing that went well beyond the initial effort to do the port. That said, it really comes down to how easy you want companies to be able to adopt this model moving forward in a rapidly changing database landscape.

We are at one of those generational shifts in technology, with the majority of new workloads going to the Cloud. We have some companies telling us they are “Cloud first” and never plan to buy another server. Some are closing their data centers altogether. The three biggest Cloud vendors in the US are AWS, Azure, and GCP. That means the database momentum for new workloads is towards Amazon Redshift, Azure Data Warehouse (recently renamed Azure Synapse Analytics), Google BigQuery, and Snowflake (which spans all three of those platforms, AWS, Azure, and GCP). This movement to the Cloud has also hurt the Hadoop vendors (MAPR almost went bust, Cloudera and Hortonworks merged into a much smaller entity) as well as the established MPP platforms (IBM dropped support for Netezza, fumbled their move to the Cloud, as did Teradata and–to a lessor extent–Oracle as well).

Access to the platforms for development and testing purposes is important. If full support is something you would consider, Snowflake has some partner programs that I can look to enable for you. (Initially we have trial instances that anyone can spin up that give you full access for 30 days preloaded with $400 in credit that you could start using today.) We can also help with porting the SqlRender/replacementPatterns.csv and we have a JDBC driver for the database connections. Maybe you could consider three tiers of support. “Core” (everything), “Base” (core schemas but missing some modules), and “Beta - In Progress” so that those who want to jump in and help support have a Github branch they can work on with a clear set of tasks to move from Beta to Base to Core.

I came to this because a large insurer expressed some interest so I started to look into it. I was unaware someone had already ported over the schema to Snowflake last August (but the check-in wasn’t accepted) and then Martijn pointed out that support has a lot more moving pieces to it so that made more sense. I’m happy to help out with our bit, especially if it helps customers jump over a few hurdles implementing.