Database platform support revisited

schuemie · November 25, 2019, 1:14pm

A constant theme in OHDSI appears to be the desire to support more and more database platforms, as you can see for example in this thread, this thread, and this thread. I just received an e-mail from an enthusiastic contributor interested in adding Snowflake.

The current state is that we have a core of well-supported platforms, including SQL Server, Oracle, PostgreSQL, and PDW. To my knowledge, all our tools run smoothly on these platforms, and for all except PDW we have unit tests with continuous integration in place to ensure (to some extent) that it stays that way.

Then there are other platforms such as Impala and BigQuery that many in the community have contributed to, but which I am not sure are currently truly supported by the entire OHDSI toolstack.

As you can read in the threads I referenced, I’m in favor of limiting the number of platforms we support, instead focusing on delivering maximum reliability on the platforms we do support. Others disagree, and appear to be willing to provide support for other platforms. However, this has the danger of creating a patchwork of support across our tools. For example, I’m unsure CohortMethod runs on BigQuery or Impala, and if it doesn’t, I would not be able to debug it.

If I can’t have my way (supporting only a few platforms well), then at least we should be transparent about the extent to which we support each platform. For this purposed I’ve set up a strawman Wiki page that tries to list what is required to support a database platform, and has names attached to each tool-platform combination. The idea is that that person is responsible for ensuring the specific tool runs on the specific platform.

Let me know where you stand. Is supporting many platforms a bad idea in general? Are you willing to put down your name in one of the empty cells? Could the ATLAS/WebAPI team help clarify what they think is required to support their software?

gregk · November 25, 2019, 2:17pm

Hey @schuemie,

Here is what we discussed with Patrick, or at least in the context of ATLAS.

There are “core” OHDSI databases. Those are being supported by OHDSI community - meaning features are developed, tested and supported against them by the OHDSI community core team. I think you correctly listed them as SQL Server , Oracle , PostgreSQL , and PDW. In all fairness, I am not sure how PDW (or Oracle or SQL Server) made this list since those are commercial database that require license. Or is based on community adoption - then would be nice to capture community users, so that we know this is true - at least.
There are other databases that are evolving as a part of ATLAS adoption. The model here is simple - organizations or commercial vendors can propose to provide support for those, at least until there is a need. For example, Odysseus is currently providing support for RedShift, Impala and Google BigQuery since we have commercial customers using these. i also believe James Wiggins and Amazon do provide ongoing support for RedShift as well and we might have moved RedShift into the first category (“core”).

That raises two important questions - also raised multiple times.

To be able to provide development and support as a part of community - there needs to be community infrastructure. Similarly to you, Odysseus has no access to 3 out of 4 core database since they require a commercial license yet we are a part of the “core” team doing development and support for ATLAS. I know that as a part of ATLAS 2.8 work, Lee has been working to set up a truly shared infrastructure but not sure what the latest status of this is. Also, PWD would definitely not fit into that either way
When we mean “provide support”, do we mean ATLAS or a broad OHDSI ecosystem. Right now Odysseus was primary focused on ATLAS and doing some fixing for PLE and PLP packages. Martijn just mentioned the CohortMethod package - and there are certainly other packages that are important for OHDSI as well. Not only this is significantly expanding a scope but we would also need to decide what makes up a list of “gold” OHDSI packages and tools that need to be supported - otherwise it could be a universe of things.

I think it certainly has its challenges but the truth of the matter is - having a need for various database is inevitable. We cannot expect that all OHDSI members - being a worldwide community - would require all members to use PostgreS (not even going to scale beyond small size database) or guide them to buy SQL Server or Oracle licenses. In any mature organization, there are organizational corporate standards and those exist for a reason. At the same time, I do believe if some organization wants to add a database they use for OMOP hosting - they either should provide support for that or find a commercial vendor who will (and able to).

Stephen_Pace · November 26, 2019, 2:08am

Hi, speaking as the enthusiastic contributor mentioned by Martijn, I recognize the difficulty of supporting additional platforms. At my previous company, we had a product with millions of lines of code, and each new port added a huge amount of testing that went well beyond the initial effort to do the port. That said, it really comes down to how easy you want companies to be able to adopt this model moving forward in a rapidly changing database landscape.

We are at one of those generational shifts in technology, with the majority of new workloads going to the Cloud. We have some companies telling us they are “Cloud first” and never plan to buy another server. Some are closing their data centers altogether. The three biggest Cloud vendors in the US are AWS, Azure, and GCP. That means the database momentum for new workloads is towards Amazon Redshift, Azure Data Warehouse (recently renamed Azure Synapse Analytics), Google BigQuery, and Snowflake (which spans all three of those platforms, AWS, Azure, and GCP). This movement to the Cloud has also hurt the Hadoop vendors (MAPR almost went bust, Cloudera and Hortonworks merged into a much smaller entity) as well as the established MPP platforms (IBM dropped support for Netezza, fumbled their move to the Cloud, as did Teradata and–to a lessor extent–Oracle as well).

Access to the platforms for development and testing purposes is important. If full support is something you would consider, Snowflake has some partner programs that I can look to enable for you. (Initially we have trial instances that anyone can spin up that give you full access for 30 days preloaded with $400 in credit that you could start using today.) We can also help with porting the SqlRender/replacementPatterns.csv and we have a JDBC driver for the database connections. Maybe you could consider three tiers of support. “Core” (everything), “Base” (core schemas but missing some modules), and “Beta - In Progress” so that those who want to jump in and help support have a Github branch they can work on with a clear set of tasks to move from Beta to Base to Core.

I came to this because a large insurer expressed some interest so I started to look into it. I was unaware someone had already ported over the schema to Snowflake last August (but the check-in wasn’t accepted) and then Martijn pointed out that support has a lot more moving pieces to it so that made more sense. I’m happy to help out with our bit, especially if it helps customers jump over a few hurdles implementing.

schuemie · April 6, 2020, 11:51am

Yes, I’m opening this discussion again.

I have spend many hours these past weeks in providing support for BigQuery. I could have spend that time on adding new features (e.g. a Shiny app for SelfControlledCaseSeries, and removing our dependency on ff), or generating evidence for patients. And I know I’m not the only one.

As OHDSI we really must decide: do we want to either

support as many platforms as possible, or
improve the functionality and reliability of our software

these two options are mutually exclusive in my opinion. My proposal is that we commit to a few platforms, and I do not really care much which ones. And the we fully commit to those, meaning for example we also make extensive testing facilities available to developers.

@MauraBeaton: could you put this on the agenda for this week’s Steering Committee call?

gregk · April 6, 2020, 4:43pm

Hi @schuemie,

First of all - thank you for your help last week, definitely appreciated. Last week was certainly an interesting one that would be worth doing reflections on about what worked, did not and what can we do to improve.

From a perspective of making a choice between the two choices you highlighted, I would support

support as many platforms as possible, AND
improve the functionality and reliability of our software

I simply believe that if OHDSI wants to be inclusive - and not exclusive - and increase data partner participation. we need to support a wider range of databases that are convenient to its core members. However, I do agree we should do better next time and personally would focus on these three areas:

Better testing
Better execution
Better visibility on work done across components and closer collaboration and planning across component owners

We provide support for BigQuery and our first experience was not exactly what I would call a “smooth run” :(…

However - as a result of working through COVID-19 study last week - I see a number of gaps in our process, including our internal understanding of a wider process. For example, we do intensive testing for ATLAS on BigQuery but a few issues surfaced outside of ATLAS related to R, especially when trying to use these packages on new types and size of of data.

Also, quite a few issues that we were fighting with were defects in R, R environment incompatibility etc.

And yes, we did discover some unexpected BQ limits when trying to execute packages - we will be working with Google on understanding these better.

So, I think we need to discuss how we can do these better:

If we know about activities in other packages ahead of time - we would test it and make sure BigQuery (or other database we support) work. For example, it would be wonderful if we have a wider monthly OHDSI eco system planning where we discuss these, assess impact and plan cross package activities.
there are a lot of history in code R packages - SQL Renderer as one example. Yes, there are some hiccups that we experienced due to not clearly understanding certain design intentions - oracleTempSchema is one. Not sure what the best solution is here but I feel that establishing a closer community collaboration would help us to avoid the misunderstanding and misuse of these.
I believe we need to finally standardize on our execution environment. As an idea, we could use ARACHNE Data Node - which includes a clean R execution environment with packaged OHDSI libraries - to both test packages while under developed and use that environment to execute across data providers. This way we do not have to fight with R and other environment issues in cases if developer uses one version but it is being executed on another version.
We need to come up with a better test data set / sets. Not all of us have access to a wide range of data directly. And SynPUF does not really help here.
In ATLAS, we can generate PLE and PLP skeletons. At least it gives us a consistent code base. But there is no way to do the same for the rest of analysis. If R is still a way we want to continue to execute studies going forward, I would propose to look at how we can build ATLAS to generate R code for all types of analyses supported by ATLAS. Where we have ATLAS - we should promote to use it, where we do not - at least it will be our standard code base.

So, yes - I think we can / will do better next time, but I believe we should focus on understanding where the gaps were and seeing how we can prevent these issues from happening next time by putting more effort in improving our joint testing, coordination and execution. Let’s discuss these in our next steering call

Vojtech_Huser · April 6, 2020, 10:11pm

Another option is to try to increase the number of developers tasked with SqlRender support. If a set of users needs support for db platform X - they have to contribute development hours to SqlRender (+ OHDSI-verse) for platform X.

Ajit_Londhe · April 7, 2020, 7:57pm

It’s a balancing act for sure. For Amgen to be able to adopt OHDSI tools, I needed to add Spark to SqlRender, but I also fully appreciated @schuemie’s earlier post about wanting to limit the number of supported dialects. It’s a burden to provide good solutions for 1 dialect let alone multiple. Spark, for instance, has a lot of annoying SQL shortcomings that I’ve had to get creative with solving, and I wouldn’t want @schuemie to have to get into the nitty gritty of them because he shouldn’t have to.

I’m wondering about the following model:

OHDSI officially backs the most widely-used dialects. The selection of dialects can be data-driven, e.g. based on OHDSI sites and industry trends. This selection can be revisited annually.
For niche dialects (like Spark today), we identify owner(s) that hold the responsibility of that dialect’s users to engineer solutions and provide support, not OHDSI.

and exactly 1 of the following (oh god, I’m writing this post like an Atlas cohort definition)

We would have prominent disclaimers in the SqlRender repo about the relative quality of those dialects compared to the official dialects. For the other software package repos, we could use an issue template with that disclaimer to try to prevent people from posting issues about dialect there.
We simply do not merge these dialects into master, but instead, it’s up to sites to maintain their own branches of affected tools: SqlRender, DatabaseConnector, WebAPI, Atlas.

Patrick_Ryan · April 7, 2020, 8:32pm

@schuemie : I think you are raising an important topic, which we’ve discussed several times and is worth continued re-evaluation on a regular basis.

We are trying to do something extraordinarily ambitious here in OHDSI: create a community of developers who can collaborate to build an open-source analytical ecosystem which can be deployed across a range of different healthcare environments by a diversity of stakeholder groups with disparate IT/informatics/statistical skills, and ensure that the analytics ecosystem is sufficiently capable and performant such that observational data contained within these environments can be analyzed within this ecosystem to design and execute characterization, estimation, and prediction studies which generate reliable evidence that promotes better health decisions and better care. And if that wasn’t already hard enough, we increased the degree of difficulty by another order of magnitude by also trying to accommodate multiple different technology stacks which have varying capabilities/strengths/limitations and which are annoyingly inconsistent in their approach to processing ‘standard syntax’ like SQL. And if THAT wasn’t hard enough, we’ve tied one arm behind our back by encouraging community development to take place but not providing the community a centralized environment for testing that can be applied across all of the tech stacks that we would like to support. So, we find ourselves in a constant game of whack-a-mole, where a fix that someone contributes which may work in one environment may induce a break in another environment. I agree the current approach we are taking now as a community is sub-optimal and not sustainable. Just as our community is maturing in the scientific best practices we employ to collaboratively more generate reliable evidence, our community needs to mature in its collaborative approach to open-source development. This isn’t about the individual contributions from any one person or organization, I greatly appreciate the support that everyone is trying to provide; this about how we move forward as one community by having a shared approach to development, testing, and deployment so that together we can advance OHDSI’s mission as efficiently as possible.

As I see it, the question isn’t whether we support a small or large number of database platforms, but rather the question is ‘HOW do we support database platforms as a community?’.

I see two minimum requirements for a database platform to be declared ‘supported’ within the OHDSI community:

The platform provides the technical capability to perform a defined set of functions that are deemed necessary to conduct observational analyses and generate reliable evidence. Declaring that we support multiple platforms creates an explicit constraint that we are limited to ‘the lowest common denominator’ in terms of platform functionality. We need to set our standards of what the ‘lowest acceptable capability’ is. So, going back a bit into some of our initial ecosystem design decisions: OHDSI is building a toolkit that uses SQL as a database extraction layer and R as an analytics layer (with JDBC providing the connection between the database environment and R environment). Within SQL, there are a set of functions that we determined to be essential to enabling the types of analyses we are focused on with OHDSI: SELECT, CREATE, INSERT, DELETE, UPDATE operations, aggregate functions like COUNT, SUM, MIN, MAX and AVG, windowing functions like ROW_NUMBER OVER (PARTITION BY X ORDER BY Y), date-based arithmetic, ability to create TEMP tables and alias them multiple times within a subquery, JOINing tables using inequalities, support for multiple schemas, etc. It was for this technical reason that we made the conscious decision early on not to support environments that didn’t meet these minimum standards: at the time, this was the basis for explicitly not supporting mySQL and SAS.

As a general premise, I would assert that we should never sacrifice our goal of reliable evidence generation or limit our analytical capabilities beneath some minimum standard just to accommodate a broader array of database platforms, but we should also be flexible to recognize that we may determine in the future that we have to change our ‘minimum standard’ or that database platforms may become more feature-complete, such that decisions of which platforms should and should not be potentially eligible may change over time.

The platform, and domain expertise about the platform, needs to be readily accessible to the developer community such that the same battery of tests can be performed across all ‘supported platforms’ and validated to pass prior to analysis packages being released to the community. More ideally and likely the behavior we need to consider moving forward, is that all developers making individual contributions, such as bug fixes or feature enhancements, will first test their contribution across the ‘supported platforms’ before making a pull request, so that all testing doesn’t fall to the package owner.

Its #2 where I know we have had several discussions within the community over the past many months but for which we probably need renewed focus to translate those ideas into action. @anthonysena @Frank @Chris_Knoll @gregk @lee_evans and others had been initially discussing an improved approach to testing in the context of ATLAS development, with a target to focus on this as we initiate development on ATLAS 3.0 after ATLAS 2.8 is released. We can likely decouple the discussion on ATLAS development progress from the broader community needs, outlined below.

We need to have centralized testing environment where instances of the ‘supported platforms’ can exist with some test data in OMOP CDM format. I know @lee_evans has made considerable progress on the design of this, and the integration with Jenkins to allow for testing across multiple platforms. We probably need to formalize this a bit more, define the platforms currently available (or soon to be added) and the instructions for using the test environment in development. Supporting a platform within the OHDSI infrastructure requires resources: beyond Lee’s time and effort, there is the compute and platform licensing expenses. Here, I hope our friends at the various platform providers, such as AWS/RedShift, Google/BigQuery, and Microsoft, may be able to help us as I suspect they’d share our desire to have their platforms be in the centrally supported set within the OHDSI ecosystem.
We need to improve our practices in writing test cases to provide greater code coverage, so that each new feature added is accompanied with some series of tests to ensure the new feature is working as desired. We likely need to go back and pay off some of the technical debt of not having sufficient test code coverage for the existing set of tools. I know @jennareps and @Chris_Knoll have been hard at work in improving their test code coverage in the components they are leading, and I think that can serve as a good model to try to emulate elsewhere.
We need people in the community committed to being the named ‘platform expert support’ individuals with sufficient platform domain expertise that when issues arise that are platform-specific, they can be quickly resolved such that we don’t hold back progress for other community members who are using other platforms. I don’t know who within our community wants to raise their hand to be our resident expert in MSSQLServer, Oracle, PostgresQL, RedShift, BigQuery, Netezza, Spark, etc., but I would assert we need at least one name tied to each platform if we really want that platform to be sustainably supported. We can’t expect all developers in the community to have to be knowledgeable about all platforms (it’s hard enough for one developer to develop competency in just one platform!). If we have a test environment with all ‘supported platforms’ in place and a robust set of test cases that can run across those platforms, then when we find an issue that arises in only one platform, we should be able to collaborate with the ‘platform expert’ to find a platform-specific solution (and then re-test to ensure that any new solution posited doesn’t break on the other platforms).
More generally, as an OHDSI developer community, I think we need to align our approach to software development if we are going to try to be more explicit in ‘platform support’. Right now, it feels like we have a subgroup of developers focused on analytics development in R and a different subgroup of developed focused on web app development, but we really need to break down these silos and agree to basic principles for the process of software development, testing, and deployment. I know I started a discussion on the forums about this last year, but that discussion ended up being mainly about governance rather than implementation and I accept responsibility for not driving this through from discussion to action more thoroughly. Most of the stuff we need to sort out is truly technology-agnostic, but we need to agree as a community to adhere to guidelines and best practices for things like how we use GitHub, how we post issues and make pull requests, how we test new code, how we issue releases, etc. @schuemie drafted developer guidelines for the OHDSI Methods Library, and I this could likely serve as the useful starting point to reach consensus on the technical aspects of development.

I recognize there’s a real tension here between ‘doing the science’ and ‘doing the work to enable the science to be done’, and a clear urgency to do both as best we can. I also recognize that the OHDSI research community and data network is growing quite quickly, and the accelerating interest in using the OMOP CDM and OHDSI toolkit to generate real-world evidence is far outpacing the growth of our OHDSI developer community. We currently have a very limited set of resources (people and money) to support the maintenance and development of the OHDSI analytics ecosystem, so we have to be respectful of the time and energy that has gone into this foundational work and the efforts that will be required moving forward. If this pandemic has taught us anything thusfar, its that we can be creative in finding more effective ways to work together: even more virtual but even more collaboratively, particularly when matters of public health are on the line. Since our shared goal is to improve health by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care, I think its our responsibility to find a path forward that builds on the success of what we have already created together and provides a sustainable approach to development for the future ahead.

szarfman · April 8, 2020, 12:09am

Imagine if we would have a health care system whereas the data would be generated, collected, and transported to the end users in a totally interoperable way.

All the data acrobatics that we are currently forced to do would be history. We would seamlessly be able to automate the generation of analytical outputs without the need to constantly having to reconfigure the data and the tools.

Analyses could be tailored at improving comprehension. Costs in money, work, and time would be reduced, as well of medical errors.

The OHDSI community is showing that we need to move in that direction.

The knowledge gained in building an interoperable common data model needs to be extended, enhanced, and tested so we can use an extensively enhanced common data model to collect medical data in an interoperable way, without relying in data acrobatics for analysis.

schuemie · April 8, 2020, 6:56am

Thanks @Patrick_Ryan. Totally agree with everything you said.

I’d like to emphasize the importance of having a proper testing environment in place. I would argue we should as a rule not provide support for any platform for which the tool developers (like me) do not have access to a test instance.

This not only would save people like me a lot of trouble, it would also mean that in a network study (like our COVID-19 studies) data partners would experience far fewer problems when executing the analysis code on their data.

roger.carlson · June 2, 2022, 2:57pm

So, two and a half years later, what has become of support for Snowflake?

gregk · June 2, 2022, 7:56pm

@roger.carlson …quick update. Odysseus worked with Snowflake to enable OHDSI support. The work has been completed and we are in the final testing stage. The official announcement is coming out here on forums very soon - stay tuned

Adam_Black · August 30, 2022, 7:41pm

It looks like there are some PRs open for review regarding snowflake support opened by @alex-odysseus. I didn’t realize the conversation around this started in 2019.

github.com/OHDSI/WebAPI

Snowflake dialect support

OHDSI:master ← OHDSI:snowflake

opened 04:30PM - 15 Aug 22 UTC

alex-odysseus

+33 -16

Addresses https://github.com/OHDSI/WebAPI/issues/2069 It can be merged only a…fter releasing of the corresponding libraries: - DatabaseConnector (@schuemie please support to merge https://github.com/odysseusinc/DatabaseConnector/tree/snowflake into https://github.com/OHDSI/DatabaseConnector/tree/develop and release after) - SqlRender (@schuemie please support to merge https://github.com/odysseusinc/SqlRender/tree/snowflake into https://github.com/OHDSI/SqlRender/tree/develop and release after) - ACHILLES (@fdefalco please support to merge https://github.com/OHDSI/Achilles/tree/snowflake into https://github.com/OHDSI/Achilles/tree/develop and release after) - CommonDataModel (@clairblacketer please support to merge https://github.com/odysseusinc/CommonDataModel/tree/snowflake into https://github.com/OHDSI/CommonDataModel/tree/main) - ARACHNE Commons (https://github.com/OHDSI/ArachneCommons/tree/snowflake_webapi to be merged to https://github.com/OHDSI/ArachneCommons/tree/1.17.x and released after)

I’d like to bring up an important principle in software design that I think applies to this discussion. It is called the open-closed principle. It says that software components (like SqlRender) should be open for extension but closed for modification. Applied to this discussion it means that the core components of OHDSI tools should allow community members and organizations to extend the functionality in a way that does not require any changes to the core software. For example vsode is popular software application because of the huge number of extensions available for it. Every new extension adds value to the core set of functionality provided by vscode and doesn’t need to be maintained by the vscode developers.

My vision for “OHDSI Community Developed CORE Software” would be to embrace the open-closed principle and allow the community to extend it without modification of the core codebase. Extensions should be able add value to the core without introducing any changes or maintenance burden to the core. Adding/changing database platforms seems like a predictable way in which OHDSI software is likely to be extended/changed.

Chris_Knoll · August 30, 2022, 9:43pm

From a sql render perspective, it just means implementing some mechanism of ‘hooks’ which would allow new dialects to be added as needed. I think it’s safe to say that the csv file that translate from mssql to a target dialect are distinct-by-dialects within the file, so it seems simple enough to just split the csv into dialcect-specific translation rules, and the core of sql render just accepts a dialect specification as an input and it will perform the appropriate translation. If there is a new dialect, then sql render will accept the dialect translation specification (via the ‘hook’) and one dialect doesn’t need to know about any other.

There are a couple of concerns:

sql render is more than a translation csv file: there are places in the code with IF…THEN…ELSE statements that handle dialect-specific behavior (I believe). So, a certain amount of abstraction/scaffolding will need to be defined to accomodate those dialect-specific behaviors.
if we go the route of BYOT (bring your own translator), then we will have varied support of the different dialects because they will be maintained by the org that produces the translator. I don’t imagine netezza will have the same level of support as redshift, and this isn’t a statement about value, it’s just a statement about resources. For this reason, we take additions to SqlRender dialects very seriously because we want to treat each supported dialect with the same level of care. This is why you might have heard the suggestion that if we are going to support new dialects, maybe we should retire some others so that the overall maintenance burden is the same.

-Chris

stevepatterson · September 14, 2022, 3:26pm

What is the current status of Snowflake DDL and Support for OMOP?

jmethot · September 14, 2022, 3:49pm

In lieu of a more “official” answer, see this October 18 Snowflake webinar:

Webinar runs noon-3:30pm EST but includes this session:

1:30 PM: Observational Medical Outcomes Partnership (OMOP) & Snowflake

mikecjohn · October 11, 2022, 12:20am

Here is what I have found so far. I’ll keep searching.

SqlRender 1.10.0 has been released. Most important changes are added support for Snowflake, Synapse, and DuckDB.

DatabaseConnector 5.1.0 has been released. Most important changes are support for Snowflake, and updated RedShift drivers.

cc: @stevepatterson @Adam_Black @roger.carlson

mikecjohn · November 9, 2022, 5:05pm

Here is the official announcement.
If your organization does not have a Snowflake account and you are interested in access to a Sandbox for testing, there is a link to sign up for that when it becomes available.