Database platform support revisited

gregk · April 6, 2020, 4:43pm

First of all - thank you for your help last week, definitely appreciated. Last week was certainly an interesting one that would be worth doing reflections on about what worked, did not and what can we do to improve.

From a perspective of making a choice between the two choices you highlighted, I would support

support as many platforms as possible, AND
improve the functionality and reliability of our software

I simply believe that if OHDSI wants to be inclusive - and not exclusive - and increase data partner participation. we need to support a wider range of databases that are convenient to its core members. However, I do agree we should do better next time and personally would focus on these three areas:

Better testing
Better execution
Better visibility on work done across components and closer collaboration and planning across component owners

We provide support for BigQuery and our first experience was not exactly what I would call a “smooth run” :(…

However - as a result of working through COVID-19 study last week - I see a number of gaps in our process, including our internal understanding of a wider process. For example, we do intensive testing for ATLAS on BigQuery but a few issues surfaced outside of ATLAS related to R, especially when trying to use these packages on new types and size of of data.

Also, quite a few issues that we were fighting with were defects in R, R environment incompatibility etc.

And yes, we did discover some unexpected BQ limits when trying to execute packages - we will be working with Google on understanding these better.

So, I think we need to discuss how we can do these better:

If we know about activities in other packages ahead of time - we would test it and make sure BigQuery (or other database we support) work. For example, it would be wonderful if we have a wider monthly OHDSI eco system planning where we discuss these, assess impact and plan cross package activities.
there are a lot of history in code R packages - SQL Renderer as one example. Yes, there are some hiccups that we experienced due to not clearly understanding certain design intentions - oracleTempSchema is one. Not sure what the best solution is here but I feel that establishing a closer community collaboration would help us to avoid the misunderstanding and misuse of these.
I believe we need to finally standardize on our execution environment. As an idea, we could use ARACHNE Data Node - which includes a clean R execution environment with packaged OHDSI libraries - to both test packages while under developed and use that environment to execute across data providers. This way we do not have to fight with R and other environment issues in cases if developer uses one version but it is being executed on another version.
We need to come up with a better test data set / sets. Not all of us have access to a wide range of data directly. And SynPUF does not really help here.
In ATLAS, we can generate PLE and PLP skeletons. At least it gives us a consistent code base. But there is no way to do the same for the rest of analysis. If R is still a way we want to continue to execute studies going forward, I would propose to look at how we can build ATLAS to generate R code for all types of analyses supported by ATLAS. Where we have ATLAS - we should promote to use it, where we do not - at least it will be our standard code base.

So, yes - I think we can / will do better next time, but I believe we should focus on understanding where the gaps were and seeing how we can prevent these issues from happening next time by putting more effort in improving our joint testing, coordination and execution. Let’s discuss these in our next steering call