Strategus design discussion

SCYou · August 10, 2022, 7:47pm

Although I don’t how it would be relevant with the current debate, please let me share the recent challenges I’ve had.

The HIRA (sort of CMS in Korea) recently announced to let researchers use HIRA’s data for COVID-19 research. For this project, we had to use HIRA’s infrastructure based on Cloudera Data Science Workbench (CDSW) without internet access.

We had challenges below:

Current CDSW in HIRA provides only R 3.5.1 without internet access. It’s not possible to upgrade R (Current HADES basically does not support R version 3; Later, the HIRA’s CDSW will be upgraded to support R version 4)
We can build custom Docker image for this project in CDSW environment(Create a Dockerfile for the New Custom Image), but HIRA doesn’t want to use a new Docker image for every study. The HIRA will install only one Docker image for the whole project.

So Chungsoo Kim made the Docker image like below:

github.com

dr-you-group/CHAPTER/blob/develop/Docker/HiraR3.5/Dockerfile

# Dockerfile

# Specify a Cloudera Data Science Workbench base image
FROM docker.repository.cloudera.com/cdsw/engine:8
RUN rm /etc/apt/sources.list.d/*

# install java develop kit and rJava
RUN apt-get update && \
    apt-get install -y default-jdk
RUN R CMD javareconf
RUN R -e "install.packages('rJava', dependencies = TRUE)"

# install devtools
RUN apt-get update
RUN apt-get install -y \
    build-essential \ 
    libcurl4-gnutls-dev \
    libxml2-dev \
    libssl-dev \
    libgit2-dev \

This file has been truncated. show original

We had to find the versions compatible with R 3.5.1 in OHDSI ecosystems. On top of this Docker image, each study package will be installed and executed.

I think that Strategus is not just about versioning. That’s why I mentioned:

Although I don’t how it would be relevant with the current debate

And I feel like Stategus can be complementary with Docker. I hope the recent challenges I’ve shown can be helpful for further debate.

schuemie · August 11, 2022, 8:37am

Thanks all for a great discussion so far!

To hopefully help the discussion, I’d like to distinguish between 3 topics:

Module in and outputs
Execution environment specification level: module, study, HADES snapshot
Nature of execution environment: renv lock file vs. Docker

1. Module in and outputs

I think inputs should be fully defined, so not something as generic as

Function name (e.g.‘matchOnPs`)
Arguments (eg.‘maxRatio = 100`)

But fully detailing the valid keys and values, so whoever develops the editor doesn’t need deep knowledge of the HADES packages. The input can be a complex tree structure. I created an example JSON Schema here, which I think we can extend to all modules.

The outputs should fit in a relational database, but the exchange format is CSV files, because those are human-reviewable to make sure no sensitive data is being shared. An example data output format is here.

One thing we haven’t discussed much is dependencies between modules. For example, most modules require cohorts to be generated first. We currently capture this in the module meta-data, e.g. here, which I prefer over capturing it in the input specifications (where users can make errors).

2. Execution environment specification level

We currently have this implemented at the module level: each module carries a renv.lock file. This provides a lot of flexibility while also providing isolation (I think someone used the term ‘separation of concerns’ which I like). The downside is increased complexity in another way, as @adam_black pointed out, where we introduce a new type of modules in the R world.

I’m starting to warm up to @adam_black’s idea of a hybrid between study level and HADES snapshots: Most studies would use specific HADES snapshots (e.g. created twice per year), but if someone needed the latest version of a module you could include a study-specific renv lock file, and sites could decide whether they want to run this or not. This has the advantage that we know for sure the modules work together nicely. It has the disadvantage that it does not have the aforementioned isolation. It also provides less flexibility: if you want to use a new version of one module, you are forced to also use the latest version of other modules in your study. I’m not sure that is necessarily a bad thing.

3. Nature of the execution environment

We’re currently focused on renv lock files to specify the execution environment. These are not ideal, as they do not capture the R version, or other dependencies such as Java and Python. They do have the advantage they work inside R, so are much easier to deploy than Docker images. My non-representative survey showed 11% of OHDSI sites cannot use Docker. (My site is one of those, so I’ve never had the pleasure of toying with Docker, which may bias my perception.) A Docker image per study seems problematic because of security reasons and the sheer size of these containers (I think), but @adam_black’s hybrid approach may alleviate those concerns.

Adam_Black · September 7, 2022, 12:28pm

Regarding the execution environment specification level:

I set up RStudio Package Manager and loaded Hades and all Hades dependencies (~175 packages total) into it.

This URL describes how to use it.
http://159.223.131.237:4242/client/#/repos/1/overview

It’s a free trial for about a month to see if people in the community find it useful. The Hades packages are automatically updated. The CRAN packages are frozen and updated manually (or could be updated on a schedule using cron).

The goal here is to create a Hades distribution and make it easy to install the of the entire set of Hades packages and dependencies included in a distribution. It is possible to can add pre-compiled binary Hades packages to the repo to speed up installation but have not done that.

Chungsoo_Kim · September 7, 2022, 12:49pm

Hi Adam, is it like MRAN (cran time machine) for HADES packages? that’s very cool.

I usually used docker to install Hades packages in various R versions (3.x/ 4.x) and Internet environments (online/offline).
but due to the flexible r ecosystem and numerous dependencies, it was difficult to freeze past packages. It will definitely help.

schuemie · September 7, 2022, 2:50pm

Cool!

After yet another bad experience with CRAN (I’m unable to get EvidenceSynthesis back in CRAN, which was removed because Cyclops was removed because a new version of RCPP has a memory leak on some esoteric operating system), I’m wondering if we should switch to this solution instead.

What would be the cost if we wanted to keep running this?

admin · September 7, 2022, 4:53pm

Hi @Adam_Black I sent you login credentials for the existing OHDSI repo.ohdsi.org Nexus repo, which supports caching R packages, as we previously discussed. Did you have a chance to try it out?

How does the paid RStudio Package Manager solution compare in terms of functionality with the free OHDSI Nexus repo functionality? Are there significant advantages?

Here is a link to the Nexus repo docs on support for R packages

Chris_Knoll · September 7, 2022, 6:06pm

If this mechanism of hosting R packages can support multiple versions in the same R package manager, then that would be a huge benefit…

Adam_Black · September 7, 2022, 7:04pm

The cost for a single package repository is $5,000/yr (Euro and dollar are about equal) plus the server hosting which is about $12/month on digital ocean and includes 2TB of data transfer per month which seems like plenty. I think it would simplify the builds of the Odysseus’ execution engine docker container and @lee_evans’s Broadsea-hades container and would be helpful for the use of Hades in Darwin. That’s to say that I see many stakeholders who would benefit. There could be a significant licensing discount for if this were used for teaching (Roux), or academic research, or purchased by a non-profit org.

I think the OHDSI R packages need to be in some repo and it would be nice to stabilize the full set so if we are “installing OHDSI/Hades” at a new offline site we know exactly what to install. `devtools::install_github(‘ohdsi/Hades’) has never worked very well for me and will give a different install the latest version of all packages so each person gets a different install depending on when they run it.

Yes it is like a time machine where we can create static snapshots of all Hades packages and their dependencies just like MRAN.

Yea it seems like Hades will not ever be on CRAN at this point. The number of packages is growing. More packages depend on CirceR which contains ~44MB of compiled Java code, way too big for CRAN but fine for this package manager.

Thanks Lee! I have not had a chance to try out nexus. That would be one free alternative to consider. The advantages of the RStudio package manager are:

Easy updates of subsets of cran and github. Github is 100% automated and cran updates can be scheduled or done manually.
Create static URLs that will always point to the same set of packages (A distribution = a static url)
Host pre-compiled binary versions of packages making installs significantly faster on Mac, Windows, & Linux. (I have not set this up for github packages yet.)
Download statistics (Nexus might provide this too but I’m not sure)
Automatic discovery of all the system requirements for the Hades stack (see Install System Prerequisites http://159.223.131.237:4242/client/#/repos/1/overview)

Yes it definitely does support multiple versions of each of the packages. You can request a specific version of one package.

options(repos = c(ohdsi = "http://159.223.131.237:4242/ohdsi/__linux__/jammy/latest"))
remotes::install_version("CohortMethod", "4.1.0")

I’m still experimenting with it and I encourage anyone in the community to try it as well.

Chris_Knoll · September 8, 2022, 8:20pm

Sooooo…not to sidetrack this convo, I’d just like to touch on that example (so that it might not be used against me in the future): the 44MB comes from a Javascript emulator that’s being referenced in order to do some kind of npm package execution inside the JVM. I’d like to completely remove this behavior from CirceR so that we can rid ourselves of that dependency.

I return you now to your regularly scheduled Strategus design discussion…

Javier · September 9, 2022, 6:44am

Hey @Adam_Black ,

I believe that HADES:PLP uses some python modules.
Rstudio package manager also keeps track of python modules.
Can this feature be use to also keep tack the python modules used in HADES:PLP ??
Or this is not necesary ?

(we are also considering Rstudio package manager fro our organisation, and trying to understand how it handels python modules)

jpegilbert · September 9, 2022, 4:41pm

I have made some large steps with a ResultModelManager package here and would appreciate more eyes to review this:

https://github.com/OHDSI/ResultModelManager

For more information on this package I wrote a short project spec here Project specifications · OHDSI/ResultModelManager Wiki · GitHub

There is also some documentation but I’m updating that today to provide more concrete examples. I have also made this PR (Database migration utilty by azimov · Pull Request #853 · OHDSI/CohortDiagnostics · GitHub) in CohortDiagnostics that gives a fully working example of handling migrations on sqlite and postgres (with some platform specific sqlite migrations)

One thing to note is that, for strategus modules, the support for flat files is included. I am somewhat hesitant about this approach, however, as I don’t feel it will give a great solution for integration testing on different platforms.

I don’t want to derail this thread too much so please try and post any issue or design changes on the ResultModelManager tracker on github.

Adam_Black · September 28, 2022, 4:24pm

Hi Javier,

The package manager can host python packages in a repo separate from R packages (so multiple repos are required). This would be helpful if you wanted an isolated package manager instance inside your organization but I don’t think it is necessary for a public Hades repo because I think all the python packages are in pip. This issue is that not all Hades R packages are on CRAN or in any other publicly available package manager.

I think the demo I set up is expiring soon. If anyone wants to fund it I’d provide the admin support if needed. On the other hand there might be free alternatives to try out like nexus.