OHDSI Home | Forums | Wiki | Github

Strategus design discussion

Thanks for the feedback christian.

you may show real-life examples of result schemas and their versions and how they differ and break.

This is a good point and will be helpful for developers implementing the package.

For example, you could change the result schema for an incidence report from generic to specifying the time interval, e.g. monthly. That breaks the format.

In this context it is up to the developer to make sure conversions are compatible and design any data transformations. This will always be a limitation and requires careful consideration when making design choices.

Or, you the incidence calculator changes the way it handles the numerator or denominator

If there are fundamental changes to the output of the source package this does create problems that it may not be possible to overcome. I fully, expect there will still be situations where results in a given format will no longer be compatible. I think the only way to overcome this Careful design choices can be made such that calculations are performed from data that is not removed.

However, it is assumed that this package will never have access to the original source patient level data. In this context I will try and think of ways where we can say that a given results model is now deprecated and can never be upgraded. This is still a big improvement, for example in CohortDiagnostics it is just assumed that previous versions are no longer compatible but they may work.

Who is “azimov”, or why is this not part of the Strategus overall application?

Azimov is my personal github… I didn’t want to create an OHDSI package until the initial specifications had been created.

(really Christian’s odious grammar pickiness: plural of composite nouns is formed at the end. So, it should really be called “ResultModelManager”, even if there are many results)

This is correct, the final ohdsi package will fix this (iff we keep the same name).

Thank you all for the synchronous discussion today and for the posts on this thread. I found it helpful to have a chance to discuss the design, debate some of the tradeoffs that are being made, and listen to the perspectives of others in the community. I’ll paraphrase one of my favorite exchanges.

“OHDSI should define the interfaces and be supportive of any implementation” (i.e. Just as it does not matter what software you used to map source data to the CDM it shouldn’t really matter what software you used to map CDM data to results as long as it conforms to some set of standards.)

“Good interfaces are developed in tandem with their implementation so OHDSI should be creating the official implementation.”

I agree with both and I think those perspectives highlight an important tension in OHDSI.

I don’t want to stand in the way of Strategus development and will wholeheartedly support the design decisions of the other developers if there is general consensus. However, I do feel strongly that Strategus should be one way of implementing an OHDSI network study and not the only way to implement an OHDSI network study.

There were two definitions that I tried to draw attention to today:

An execution environment is a virtual machine or docker container with R, Rtools, python, (possibly anaconda), Java installed and an R package cellar that contains the tar.gz files for all Hades R package dependencies.

An OHDSI distribution is a complete versioned set of OHDSI software needed to run OHDSI studies.

Thanks for a great discussion. I’ll keep working on a spec and try to capture the requirements discussed today.

Thanks @Adam_Black ! I too found it a good discussion, with lots of food for thought. To be clear: I do understand your concerns about the extra layer of complexity we’re introducing. I just don’t understand what the alternative is.

Something I think is important to mention is that the current approach to modularization isn’t just about standardizing the input. I agree we could have a more generic approach to creating inputs, maybe even having a 1-to-1 relationship between the JSON and R function calls (which is what we already have in many modules). But it is also about standardizing the output, and specifically about generating output that fits in a relational database. The output of most R functions does not fit in a relational database. So Strategus isn’t just about orchestrating the execution of various R functions in HADES packages, it is also about building a coherent and consistent evidence base from the outputs.

1 Like

I think output standardization will be an important contribution. Could this be thought of as an OHDSI results common data model? I’m somewhat aware of the common evidence model but don’t know much of the details.

My primary objection to the Stratgus design is that it allows for conflicting dependencies within a single study which is an unnecessary complexity. The alternative to one renv lock files per module (Option 1) is to use both options 2 & 3. I’ll try to explain.

Suppose today we deem the current versions of all Hades packages to be Hades v1.0. We set up RStudio Package Manager and load it with all the packages needed to use all of Hades. We create a static/frozen url that will forever point to the complete set of packages included in Hades v1.0 and all the required dependent packages.

If all our packages and dependencies were in CRAN (which would be ideal in my opinion) we could use this URL https://packagemanager.rstudio.com/all/2022-08-09+Y3JhbiwyOjQ1MjYyMTU7RERFQkQ0Qjg
(@Chris_Knoll Just treat the date in the URL is an arbitrary character string :slight_smile: ) If we can’t get them on CRAN I’d suggest setting up an OHDSI instance of RStudio Package Manager.

When you create a study you have the option of building your study off of the frozen/static repository URL or the “latest” repository URL.

Open sites with access to the package manager should be able to run any study built off of the “latest” repository URL. Closed sites can only run studies built off of one of the previous “frozen” URL.

When closed site “installs” OHDSI they create an execution environment that could be a docker container or linux VM that contains, among other things, a package cellar. The cellar contains the tarballs of all versions of all R packages in the package manager up to the latest distribution and thus acts a local copy of the entire package manager.

My suggestion would be to make use of ExecutionEngine which provides a REST endpoint for R or SQL code to be executed inside the execution environment but regardless the study execution consists of just three steps:

  1. Create a new R session and call renv::restore with the study lock file
  2. Run the R code (Could be any R code)
  3. Shutdown the R session

An outstanding question in my mind is if you are planning to integrate Strategus into Atlas/WebAPI or is to create a new UI.

I think the most important contribution of Strategus will be the standardization of outputs that everyone who wants to implement a study can use as guideposts analogous to the role of CDM spec in ETL.

More generally maybe there is an opportunity to define the minimum requirements of OHDSI study that are agnostic to implementation and allow us to compute on study specifications and results. Harold Lehmann had some work on this he presented at a previous Hades call.

I’d recommend against this: it leads to what is called ‘leaky abstraction’. I don’t think we’d want to see json properties that are associated to calls to getDbCohortMethodData() or createStudyPopulation(). These are implementation specific details, and just by standardizing on the inputs/outputs of a overall unit of work, you shield users from the lower-level details allowing you to change the implementation without breaking the standard (change as in you can refactor code/cleanup your methods without exposing some changes to the json specification).

The reason why Strategus Modules are so closely mapped to the json specifications we’ve built so far is that the responsibility of the module is to map the json payload to the set of function calls that are needed from the underlying implementation. Circe is a good example of this in action, and while the only public implementation is spec-to-sql implementation, there’s no reason why a spec-to-document_store could be implemented to translate the cohort expression specs into document-store queries. These Strategus modules are currently spec-to-HADES_packages but there’s no reason why they couldn’t be implemented as spec-to-SAS_packages. I don’t see much value in HADES trying to find alternative implementations for these specifications, but it does introduce the possibility for other communities to engage (compete?) with HADES implemntations.

I don’t think we can be that ignorant about how this is functioning: being a time-based value (the rpackage manager devs even describe it as a ‘snapshot’), the progression between releases is linear. How would you support the following:

                 o -- Hades 1.1--- Hades 1.2             o - Hades 3.1
                /                                       /
Hades 1.0 ---- o ---- .Hades 2.0 -- o ---- Hades 3.0 - o --- Hades LATEST
                                     \
                                      o --- Hades 2.1 --- Hades 2.2

In case you’re wondering, this is based off the release branching of WebAPI. Hades 1.2 may have been released after 2.0 since there may be some need to stick to the 1.x line and it was released after 2.0 was complete for those who were unable to shift to the 2.x line for some reason (hint: andromeda, hint: study packages changing to renv). renv lock files solves this problem because an renv lock file contains the specific versions of things, regardless of the time interval they were released. The primary reason I dislike cran’s timestamp-based approach is that we have no control over what’s in cran at a given date so we’re forced to pick up all other version updates when we just want to grab a new one that was released on date X. If OHDSI owned it’s own RPackageManager instance, we could control all the versions of the libraries, but I still don’t see how we could do ‘side-releases’ (like HADES 1.2 along side HADES 2.1).

Thanks again for the great conversation, and I hope we can come to a decision that everyone will appreciate.

1 Like

What I’m proposing would not support this (as a feature not a bug). It would only support monotonic versioning for Hades R packages. I feel like this is trying to make R work like Java (renv.lock ~ pom.xml). Non-monotonic versioning is a complexity we don’t need to introduce. In your example there are 3 different versions of Hades (~24+ R packages) being actively maintained so the current maintenance burden is multiplied by 3. Do we really need it?

I’ve been following this discussion closely. Thank you for having it “out in the open”.

I’m not an R user or expert. I am a leader in an Informatics department at a healthcare institution that is just embarking on the OHDSI journey. I hope in a year or two our department is being called on to support numerous investigators participating in OHDSI network studies in oncology.

From that perspective this discussion seems to assume network participants will manage the byzantine R ecosystem including R versions, CRAN libraries and access, etc. I realize one goal of this discussion is to simplify that.

@Adam_Black wrote:

An execution environment is a virtual machine or docker container with R, Rtools, python, (possibly anaconda), Java installed and an R package cellar that contains the tar.gz files for all Hades R package dependencies.

I’m inferring a comma before the “and” and that the expectation is the local site will configure the proper R environment. If the intent is that the docker image contains the study-specific R environment then I’m happy and you can stop reading.

If not, I propose it should. The ultimate simplification is to package both the execution environment and the specific R environment into a single docker image per network study (which obviously might be versioned). The docker image would be parameterized with the CDM connection details and whatever else is necessary to execute against the local CDM and store/return results. The Strategus project would create packaging tools (or at least instructions) for study designers to produce the docker image, and a docker repository to manage and serve the study images.

This would permit unlimited network studies to proceed in parallel without anyone but the study designer(s) thinking about R versions and libraries, including HADES. It has the added benefit of permitting non-R objects in the network study docker image, such as shell or Python scripts, reference data, etc.

One potential obstacle might be R licensing: network sites might be executing R from within an image but without a local license. Is this moot since R itself, and most R libraries, are open source? RStudio is separately licensed but doesn’t seem necessary for scripted study package execution.

1 Like

Thanks for participating in the discussion @jmethot!

There are different ideas in the community about how to use Docker. Odysseus maintains an R execution environment Docker image freely available to anyone with all the components I described except for the package cellar (because we are still trying to get consensus on what should be in a Hades Distribution). Docker Hub

Some people have experimented with study specific Docker images. Here is an example.

Some organizations can’t use Docker at all for security reasons. Some organizations don’t want to be downloading a new Docker image for every study but would install and update docker images once per release cycle. Other organizations and people prefer the entire study to be Dockerized. That is my understanding of the current state of affairs around Docker use. If we get consensus on a standard for the execution environment I think the community would support multiple implementations using docker, or cloud formation templates, or VMs that would make it easy for sites to install and use.

Although I don’t how it would be relevant with the current debate, please let me share the recent challenges I’ve had.

The HIRA (sort of CMS in Korea) recently announced to let researchers use HIRA’s data for COVID-19 research. For this project, we had to use HIRA’s infrastructure based on Cloudera Data Science Workbench (CDSW) without internet access.

We had challenges below:

  1. Current CDSW in HIRA provides only R 3.5.1 without internet access. It’s not possible to upgrade R (Current HADES basically does not support R version 3; Later, the HIRA’s CDSW will be upgraded to support R version 4)
  2. We can build custom Docker image for this project in CDSW environment(Create a Dockerfile for the New Custom Image), but HIRA doesn’t want to use a new Docker image for every study. The HIRA will install only one Docker image for the whole project.

So Chungsoo Kim made the Docker image like below:

We had to find the versions compatible with R 3.5.1 in OHDSI ecosystems. On top of this Docker image, each study package will be installed and executed.

I think that Strategus is not just about versioning. That’s why I mentioned:

Although I don’t how it would be relevant with the current debate

And I feel like Stategus can be complementary with Docker. I hope the recent challenges I’ve shown can be helpful for further debate.

2 Likes

Thanks all for a great discussion so far!

To hopefully help the discussion, I’d like to distinguish between 3 topics:

  1. Module in and outputs
  2. Execution environment specification level: module, study, HADES snapshot
  3. Nature of execution environment: renv lock file vs. Docker

1. Module in and outputs

I think inputs should be fully defined, so not something as generic as

  • Function name (e.g.‘matchOnPs`)
  • Arguments (eg.‘maxRatio = 100`)

But fully detailing the valid keys and values, so whoever develops the editor doesn’t need deep knowledge of the HADES packages. The input can be a complex tree structure. I created an example JSON Schema here, which I think we can extend to all modules.

The outputs should fit in a relational database, but the exchange format is CSV files, because those are human-reviewable to make sure no sensitive data is being shared. An example data output format is here.

One thing we haven’t discussed much is dependencies between modules. For example, most modules require cohorts to be generated first. We currently capture this in the module meta-data, e.g. here, which I prefer over capturing it in the input specifications (where users can make errors).

2. Execution environment specification level

We currently have this implemented at the module level: each module carries a renv.lock file. This provides a lot of flexibility while also providing isolation (I think someone used the term ‘separation of concerns’ which I like). The downside is increased complexity in another way, as @adam_black pointed out, where we introduce a new type of modules in the R world.

I’m starting to warm up to @adam_black’s idea of a hybrid between study level and HADES snapshots: Most studies would use specific HADES snapshots (e.g. created twice per year), but if someone needed the latest version of a module you could include a study-specific renv lock file, and sites could decide whether they want to run this or not. This has the advantage that we know for sure the modules work together nicely. It has the disadvantage that it does not have the aforementioned isolation. It also provides less flexibility: if you want to use a new version of one module, you are forced to also use the latest version of other modules in your study. I’m not sure that is necessarily a bad thing.

3. Nature of the execution environment

We’re currently focused on renv lock files to specify the execution environment. These are not ideal, as they do not capture the R version, or other dependencies such as Java and Python. They do have the advantage they work inside R, so are much easier to deploy than Docker images. My non-representative survey showed 11% of OHDSI sites cannot use Docker. (My site is one of those, so I’ve never had the pleasure of toying with Docker, which may bias my perception.) A Docker image per study seems problematic because of security reasons and the sheer size of these containers (I think), but @adam_black’s hybrid approach may alleviate those concerns.

1 Like

Regarding the execution environment specification level:

I set up RStudio Package Manager and loaded Hades and all Hades dependencies (~175 packages total) into it.

This URL describes how to use it.
http://159.223.131.237:4242/client/#/repos/1/overview

It’s a free trial for about a month to see if people in the community find it useful. The Hades packages are automatically updated. The CRAN packages are frozen and updated manually (or could be updated on a schedule using cron).

The goal here is to create a Hades distribution and make it easy to install the of the entire set of Hades packages and dependencies included in a distribution. It is possible to can add pre-compiled binary Hades packages to the repo to speed up installation but have not done that.

1 Like

Hi Adam, is it like MRAN (cran time machine) for HADES packages? that’s very cool.

I usually used docker to install Hades packages in various R versions (3.x/ 4.x) and Internet environments (online/offline).
but due to the flexible r ecosystem and numerous dependencies, it was difficult to freeze past packages. It will definitely help.

Cool!

After yet another bad experience with CRAN (I’m unable to get EvidenceSynthesis back in CRAN, which was removed because Cyclops was removed because a new version of RCPP has a memory leak on some esoteric operating system), I’m wondering if we should switch to this solution instead.

What would be the cost if we wanted to keep running this?

Hi @Adam_Black I sent you login credentials for the existing OHDSI repo.ohdsi.org Nexus repo, which supports caching R packages, as we previously discussed. Did you have a chance to try it out?

How does the paid RStudio Package Manager solution compare in terms of functionality with the free OHDSI Nexus repo functionality? Are there significant advantages?

Here is a link to the Nexus repo docs on support for R packages

If this mechanism of hosting R packages can support multiple versions in the same R package manager, then that would be a huge benefit…

The cost for a single package repository is $5,000/yr (Euro and dollar are about equal) plus the server hosting which is about $12/month on digital ocean and includes 2TB of data transfer per month which seems like plenty. I think it would simplify the builds of the Odysseus’ execution engine docker container and @lee_evans’s Broadsea-hades container and would be helpful for the use of Hades in Darwin. That’s to say that I see many stakeholders who would benefit. There could be a significant licensing discount for if this were used for teaching (Roux), or academic research, or purchased by a non-profit org.

I think the OHDSI R packages need to be in some repo and it would be nice to stabilize the full set so if we are “installing OHDSI/Hades” at a new offline site we know exactly what to install. `devtools::install_github(‘ohdsi/Hades’) has never worked very well for me and will give a different install the latest version of all packages so each person gets a different install depending on when they run it.

Yes it is like a time machine where we can create static snapshots of all Hades packages and their dependencies just like MRAN.

Yea it seems like Hades will not ever be on CRAN at this point. The number of packages is growing. More packages depend on CirceR which contains ~44MB of compiled Java code, way too big for CRAN but fine for this package manager.

Thanks Lee! I have not had a chance to try out nexus. That would be one free alternative to consider. The advantages of the RStudio package manager are:

  • Easy updates of subsets of cran and github. Github is 100% automated and cran updates can be scheduled or done manually.
  • Create static URLs that will always point to the same set of packages (A distribution = a static url)
  • Host pre-compiled binary versions of packages making installs significantly faster on Mac, Windows, & Linux. (I have not set this up for github packages yet.)
  • Download statistics (Nexus might provide this too but I’m not sure)
  • Automatic discovery of all the system requirements for the Hades stack (see Install System Prerequisites http://159.223.131.237:4242/client/#/repos/1/overview)

Yes it definitely does support multiple versions of each of the packages. You can request a specific version of one package.

options(repos = c(ohdsi = "http://159.223.131.237:4242/ohdsi/__linux__/jammy/latest"))
remotes::install_version("CohortMethod", "4.1.0")

I’m still experimenting with it and I encourage anyone in the community to try it as well.

Sooooo…not to sidetrack this convo, I’d just like to touch on that example (so that it might not be used against me in the future): the 44MB comes from a Javascript emulator that’s being referenced in order to do some kind of npm package execution inside the JVM. I’d like to completely remove this behavior from CirceR so that we can rid ourselves of that dependency.

I return you now to your regularly scheduled Strategus design discussion…

2 Likes

Hey @Adam_Black ,

I believe that HADES:PLP uses some python modules.
Rstudio package manager also keeps track of python modules.
Can this feature be use to also keep tack the python modules used in HADES:PLP ??
Or this is not necesary ?

(we are also considering Rstudio package manager fro our organisation, and trying to understand how it handels python modules)

I have made some large steps with a ResultModelManager package here and would appreciate more eyes to review this:

For more information on this package I wrote a short project spec here Project specifications · OHDSI/ResultModelManager Wiki · GitHub

There is also some documentation but I’m updating that today to provide more concrete examples. I have also made this PR (Database migration utilty by azimov · Pull Request #853 · OHDSI/CohortDiagnostics · GitHub) in CohortDiagnostics that gives a fully working example of handling migrations on sqlite and postgres (with some platform specific sqlite migrations)

One thing to note is that, for strategus modules, the support for flat files is included. I am somewhat hesitant about this approach, however, as I don’t feel it will give a great solution for integration testing on different platforms.

I don’t want to derail this thread too much so please try and post any issue or design changes on the ResultModelManager tracker on github.

Hi Javier,

The package manager can host python packages in a repo separate from R packages (so multiple repos are required). This would be helpful if you wanted an isolated package manager instance inside your organization but I don’t think it is necessary for a public Hades repo because I think all the python packages are in pip. This issue is that not all Hades R packages are on CRAN or in any other publicly available package manager.

I think the demo I set up is expiring soon. If anyone wants to fund it I’d provide the admin support if needed. On the other hand there might be free alternatives to try out like nexus.

t