@schuemie : I think you are raising an important topic, which we’ve discussed several times and is worth continued re-evaluation on a regular basis.
We are trying to do something extraordinarily ambitious here in OHDSI: create a community of developers who can collaborate to build an open-source analytical ecosystem which can be deployed across a range of different healthcare environments by a diversity of stakeholder groups with disparate IT/informatics/statistical skills, and ensure that the analytics ecosystem is sufficiently capable and performant such that observational data contained within these environments can be analyzed within this ecosystem to design and execute characterization, estimation, and prediction studies which generate reliable evidence that promotes better health decisions and better care. And if that wasn’t already hard enough, we increased the degree of difficulty by another order of magnitude by also trying to accommodate multiple different technology stacks which have varying capabilities/strengths/limitations and which are annoyingly inconsistent in their approach to processing ‘standard syntax’ like SQL. And if THAT wasn’t hard enough, we’ve tied one arm behind our back by encouraging community development to take place but not providing the community a centralized environment for testing that can be applied across all of the tech stacks that we would like to support. So, we find ourselves in a constant game of whack-a-mole, where a fix that someone contributes which may work in one environment may induce a break in another environment. I agree the current approach we are taking now as a community is sub-optimal and not sustainable. Just as our community is maturing in the scientific best practices we employ to collaboratively more generate reliable evidence, our community needs to mature in its collaborative approach to open-source development. This isn’t about the individual contributions from any one person or organization, I greatly appreciate the support that everyone is trying to provide; this about how we move forward as one community by having a shared approach to development, testing, and deployment so that together we can advance OHDSI’s mission as efficiently as possible.
As I see it, the question isn’t whether we support a small or large number of database platforms, but rather the question is ‘HOW do we support database platforms as a community?’.
I see two minimum requirements for a database platform to be declared ‘supported’ within the OHDSI community:
- The platform provides the technical capability to perform a defined set of functions that are deemed necessary to conduct observational analyses and generate reliable evidence. Declaring that we support multiple platforms creates an explicit constraint that we are limited to ‘the lowest common denominator’ in terms of platform functionality. We need to set our standards of what the ‘lowest acceptable capability’ is. So, going back a bit into some of our initial ecosystem design decisions: OHDSI is building a toolkit that uses SQL as a database extraction layer and R as an analytics layer (with JDBC providing the connection between the database environment and R environment). Within SQL, there are a set of functions that we determined to be essential to enabling the types of analyses we are focused on with OHDSI: SELECT, CREATE, INSERT, DELETE, UPDATE operations, aggregate functions like COUNT, SUM, MIN, MAX and AVG, windowing functions like ROW_NUMBER OVER (PARTITION BY X ORDER BY Y), date-based arithmetic, ability to create TEMP tables and alias them multiple times within a subquery, JOINing tables using inequalities, support for multiple schemas, etc. It was for this technical reason that we made the conscious decision early on not to support environments that didn’t meet these minimum standards: at the time, this was the basis for explicitly not supporting mySQL and SAS.
As a general premise, I would assert that we should never sacrifice our goal of reliable evidence generation or limit our analytical capabilities beneath some minimum standard just to accommodate a broader array of database platforms, but we should also be flexible to recognize that we may determine in the future that we have to change our ‘minimum standard’ or that database platforms may become more feature-complete, such that decisions of which platforms should and should not be potentially eligible may change over time.
- The platform, and domain expertise about the platform, needs to be readily accessible to the developer community such that the same battery of tests can be performed across all ‘supported platforms’ and validated to pass prior to analysis packages being released to the community. More ideally and likely the behavior we need to consider moving forward, is that all developers making individual contributions, such as bug fixes or feature enhancements, will first test their contribution across the ‘supported platforms’ before making a pull request, so that all testing doesn’t fall to the package owner.
Its #2 where I know we have had several discussions within the community over the past many months but for which we probably need renewed focus to translate those ideas into action. @anthonysena @Frank @Chris_Knoll @gregk @lee_evans and others had been initially discussing an improved approach to testing in the context of ATLAS development, with a target to focus on this as we initiate development on ATLAS 3.0 after ATLAS 2.8 is released. We can likely decouple the discussion on ATLAS development progress from the broader community needs, outlined below.
We need to have centralized testing environment where instances of the ‘supported platforms’ can exist with some test data in OMOP CDM format. I know @lee_evans has made considerable progress on the design of this, and the integration with Jenkins to allow for testing across multiple platforms. We probably need to formalize this a bit more, define the platforms currently available (or soon to be added) and the instructions for using the test environment in development. Supporting a platform within the OHDSI infrastructure requires resources: beyond Lee’s time and effort, there is the compute and platform licensing expenses. Here, I hope our friends at the various platform providers, such as AWS/RedShift, Google/BigQuery, and Microsoft, may be able to help us as I suspect they’d share our desire to have their platforms be in the centrally supported set within the OHDSI ecosystem.
We need to improve our practices in writing test cases to provide greater code coverage, so that each new feature added is accompanied with some series of tests to ensure the new feature is working as desired. We likely need to go back and pay off some of the technical debt of not having sufficient test code coverage for the existing set of tools. I know @jennareps and @Chris_Knoll have been hard at work in improving their test code coverage in the components they are leading, and I think that can serve as a good model to try to emulate elsewhere.
We need people in the community committed to being the named ‘platform expert support’ individuals with sufficient platform domain expertise that when issues arise that are platform-specific, they can be quickly resolved such that we don’t hold back progress for other community members who are using other platforms. I don’t know who within our community wants to raise their hand to be our resident expert in MSSQLServer, Oracle, PostgresQL, RedShift, BigQuery, Netezza, Spark, etc., but I would assert we need at least one name tied to each platform if we really want that platform to be sustainably supported. We can’t expect all developers in the community to have to be knowledgeable about all platforms (it’s hard enough for one developer to develop competency in just one platform!). If we have a test environment with all ‘supported platforms’ in place and a robust set of test cases that can run across those platforms, then when we find an issue that arises in only one platform, we should be able to collaborate with the ‘platform expert’ to find a platform-specific solution (and then re-test to ensure that any new solution posited doesn’t break on the other platforms).
More generally, as an OHDSI developer community, I think we need to align our approach to software development if we are going to try to be more explicit in ‘platform support’. Right now, it feels like we have a subgroup of developers focused on analytics development in R and a different subgroup of developed focused on web app development, but we really need to break down these silos and agree to basic principles for the process of software development, testing, and deployment. I know I started a discussion on the forums about this last year, but that discussion ended up being mainly about governance rather than implementation and I accept responsibility for not driving this through from discussion to action more thoroughly. Most of the stuff we need to sort out is truly technology-agnostic, but we need to agree as a community to adhere to guidelines and best practices for things like how we use GitHub, how we post issues and make pull requests, how we test new code, how we issue releases, etc. @schuemie drafted developer guidelines for the OHDSI Methods Library, and I this could likely serve as the useful starting point to reach consensus on the technical aspects of development.
I recognize there’s a real tension here between ‘doing the science’ and ‘doing the work to enable the science to be done’, and a clear urgency to do both as best we can. I also recognize that the OHDSI research community and data network is growing quite quickly, and the accelerating interest in using the OMOP CDM and OHDSI toolkit to generate real-world evidence is far outpacing the growth of our OHDSI developer community. We currently have a very limited set of resources (people and money) to support the maintenance and development of the OHDSI analytics ecosystem, so we have to be respectful of the time and energy that has gone into this foundational work and the efforts that will be required moving forward. If this pandemic has taught us anything thusfar, its that we can be creative in finding more effective ways to work together: even more virtual but even more collaboratively, particularly when matters of public health are on the line. Since our shared goal is to improve health by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care, I think its our responsibility to find a path forward that builds on the success of what we have already created together and provides a sustainable approach to development for the future ahead.