Using renv to handle R package dependencies

schuemie · May 19, 2020, 11:43am

The current standard for distributing studies across the OHDSI network is through study R packages. These packages typically require other packages to be installed, both OHDSI packages as well as packages in CRAN.

One challenge we’re facing more and more is that these requirements may differ per study. Older studies may require an older version of a package, newer studies need a new version. Sometimes people are working on several studies at the same time, and need to switch from version to version. It is not always clear what versions are required to rerun a previously executed study, and from a reproducibility perspective that is bad.

In the past, to at least record the dependencies, I introduced the OhdsiRTools::insertEnvironmentSnapshotInPackage function, which would add a CSV file documenting all versions installed at the time of the study execution. However, we had no easy way to restore an R environment from such a file.

I therefore also experimented with packrat, a technology that looked promising, but that I never got working (on Windows I never was able to get a set of source packages that could be built without error).

An alternative that is still being explored is Docker, a technology that allow taking a snapshot of the entire working environment, including the operating system. I have not been able to run Docker images myself, but perhaps this will be our solution in the future.

In the mean time, we (me, @Chris_Knoll, and @msuchard) did discovered a new technology that looks very promising: renv is the sequel to packrat. It has some important features:

Each study package can have its own R library. So different studies can be active simultaneously, each with different dependencies, and you can quickly switch between them without the need to reinstall the dependencies.
An R library can be rebuilt from scratch based on a so-called ‘lock file’. This file describes all the packages that must be installed, which precise versions, and where to install them from.
Most of the packages are installed from binaries, not from source, thus avoiding the issues I observed with packrat. It is also quite fast (rebuilding an environment using packrat could take many hours).

Although renv has it’s own functionality for creating lock files, I found it more convenient to write my own using a function I added to OhdsiRTools. The lock files are not that complicated, as you can see in the example file I generated.

I found that restoring the environment including the study package itself can be done by simply downloading the lock file. Here’s how one could install the Covid19CohortEvaluation study package including all of its dependencies:

# Install the latest version of renv:
install.packages("renv")

# Start a new project in RStudio (or when not using RStudio, create a new folder and 
# set it as the current working directory). When asked if you want to use renv with the 
# project, answer ‘no’.

# Download the lock file:
download.file("https://raw.githubusercontent.com/ohdsi-studies/Covid19CohortEvaluation/renv/renv.lock", "renv.lock")
  
# Build the local library:
renv::init()
  
# When not in RStudio, you'll need to restart R now

# And you’re done! The study package can now be loaded and used:
library(Covid19CohortEvaluation)

I propose we test this approach on a new OHDSI study, to see how well it works.

Thoughts?

Patrick_Ryan · May 19, 2020, 12:32pm

Wow, that looks almost too good to be true! Thanks @schuemie @msuchard and @Chris_Knoll for identifying this candidate solution.

I definitely agree we should test out this approach, and if it proves to be robust, recommend we use throughout our studies. The current COVID studies are causing some R version control problems at some sites, so that may be really good opportunity, @krfeeney could probably inform where it’d make sense to do this. If we can, I’d like to incorporate this approach into SCYLLA and CHARYBDIS packaging, @anthonysena and @jweave17.

hripcsa · May 19, 2020, 12:41pm

I was also about to say wow! Even I could do that.

SCYou · May 19, 2020, 12:43pm

Awesome! I hope that renv can work in offline environment.

thomasfalconer · May 19, 2020, 2:44pm

this looks great, @schuemie! Would definitely make debugging easier (at least in knowing the package versions would be correct and that we don’t have to deal with source installs). Let me know if you need Columbia to test anything out for you.

jposada · May 19, 2020, 5:54pm

Hi @schuemie this looks promising. However, I have seen that OS dependencies are very important. As an example, currently in MacOS running packages is not a trivial endeavor given some dependencies with software outside the R world. I think that Docker is the right move. A docker image can be run and use from the command line or inside a development environment like PyCharm. Anything that does not self contains the OS will always have reproducibility issues.

With Odysseus, we have been working in generating such images and I believe we are at a point where we should be able to package an entire study in the docker. Like packing SCYLLA and distribute that as a docker instead. This approach is being used by the DREAM challenge backed up by AWS too . There are several examples that support the use of a strategy like this one. As a reference Conda environments also have the same issue across OS.

@schuemie Can you elaborate in why you have not been successful with Docker? Maybe @Konstantin_Yaroshove can help here. They have several dockers on ducker-hub already.

@gregk @krfeeney @anthonysena @jweave17 @JamesSWiggins

gregk · May 19, 2020, 6:29pm

Friends.

There is another requirement that we need to consider. In many (too many) Healthcare, Pharma and Payer settings, any internet traffic from R environments is being blocked by IT security as it is considered to be a security threat. Downloading packages on a fly simply would not work. So, we need to ensure that all required packages are pre-packaged.

So, in my mind there are two things here:

Packaging required dependent R libraries into studies for distribution
A pre-built and tested execution environment

For #2, Odysseus have developed the ARACHNE Execution Engine. This component is already being distributed as a Docker image and packages all relevant OHDSI packages tested against the latest working versions of PLE and PLP and some other core libraries. In addition to that, the ARACNE EE is capable of creating a clean execution environment for each run. The ARACHNE Execution Engine can be installed together with ARACHNE Data Node that allows someone to submit R or SQL executions and receive results back. My hope that we will also standardized on how studies and results are packages. The ARACHNE EE can also be installed with ATLAS.

Should we consider this as a part of a solution here as well?

jposada · May 19, 2020, 6:41pm

Hi @gregk,

Some follow up questions

Can ARACHNE be self-contained to incorporate the study itself in the same docker?
How will be the end-user experience while developing and testing using ARACHNE?
How will someone use Rstudio or Pycharm with ARACHNE as part of developing the study package?

This last question is what @schuemie solution is best at because it is part of the usual workflow of developing.

Also, to criticize my own proposal, technical details to build a docker may escape the skill-set of people developing the study packages. That is the reason what I keep pointing to the DREAM challenge where there is a sandbox environment to test. The premise if that the package runs against the sandbox, it will run in everybody envs

Thoughts?

gregk · May 19, 2020, 6:58pm

the idea is ARACHNE Dat Node/EE is an execution component.

My proposal was that this can be used by study creators to test packages before they are distributed. Then, on a data provider side, packages are uploaded, executed and results return back from ARACHNE Data Node.

Exactly, ARACHNE Data Node/EE is that “sandbox” or rather QC/Testing environment. It already exist and can be used as a great start

Rstudio or Pycharm is fine for code development, but you do not want to use Rstudio or Pycharm for testing - those are local IDEs. The testing should occur in an isolated, clean execution and self-contained environment outside of local workstations and even shared R servers.

Chris_Knoll · May 19, 2020, 7:22pm

Can we have the best of both?

Using @schuemie’s approach for initializing an environment can be paired with initializing a docker environment with the correct versions of the dependencies. For those that are unable to have direct connectivity to the internet, they pull the docker image over a secured connection for internal consumption. Arachne EE can do similar work to setup the execution context.

jposada · May 19, 2020, 9:04pm

Exactly. However while in RStudio or Pycharm you will do

library(foo) 
foo.someFunction()

in your code. You would want to run that piece of code using exactly what ARACHNE has. That is why I have used in the past a Docker Container inside Pycharm while developing. In this way, I can import from the container itself and not rely on my local machine. The environment for the local machine and the environment for the true test must be the same to ensure everything runs fine.

How do we bring together those two worlds as @Chris_Knoll was suggesting? My idea is having a docker that should be your environment, I do not know if ARACHNE can fill that gap.

By the way, this is my longest thread in the forums

gregk · May 19, 2020, 11:23pm

I think the 3+ of us are saying the same. These are two lego blocks of the same solution - #1 is how you package dependencies and #2 is someone first tests and then executes in the same consistent environment. These are not contradictory - these are complementary.

For those who have experience with Java - this would not be new. Packaging your code, including dependencies. And testing and distributing for a specific version of JVM.

“What has been will be again, what has been done will be done again; there is nothing new under the sun.”

SCYou · May 19, 2020, 11:20pm

@jposada Previously, I made a docker image for OHDSI tool (CohortMethod / PatientLevelPrediciton version 3)

Vojtech_Huser · May 20, 2020, 4:06pm

Replying to Martijn initial post:

Can renv and lock files deal with installing just 64 versions of all packages. E.g., I typically only install 64bit version of R. If the lock file will try to use both, will it not fail (if the lock file has both 32bit and 64bit)

jposada · May 20, 2020, 5:13pm

Thank you @SCYou! This looks great. Did you ever push to ducker hub? I will certainly look at this

jposada · May 20, 2020, 5:16pm

@gregk you are on point here. In that line of thought it will be:

A docker for having all dependencies in a single place
ARACHNE as a way to distribute and execute the study packages

ARACHNE and the Docker will need to be in sync somehow right?
Am I summarizing correctly?

jposada · May 20, 2020, 5:20pm

Just one more piece in favor of Dockerization. NC3 is looking to do the same

github.com/National-COVID-Cohort-Collaborative/Data-Ingestion-and-Harmonization

Dockerized version of OHDSI data quality / other tools for DI&H work flow

opened 07:40PM - 01 May 20 UTC

closed 02:58PM - 25 Jun 20 UTC

DaveraGabriel

Data quality review of OHDSI tools reveals utility of using these tools, particu…larly for communicating with sites. Dockerizing an array of the OHDSI tooling would support N3C. from meeting notes, 27Apr2020, comments and possible OHDSI tools that could be considered: - Raju would like to DOCKERize these processes, Clare will let us know what site(s) have done this previously - OHDSI Atlas tool, provides “a unified interface to patient level data and analytics” could also be helpful - Clare states J&J has developed a webpage which displays the consolidate charaterstics. She will look into whether we might use the code for that for our own site - One last OHDSI tool Clare suggested would be helpful is White Rabbit: “a software tool to help prepare for ETLs (Extraction, Transformation, Loading) of longitudinal healthcare databases into the OMOP Common Data Model (CDM)”

schuemie · May 21, 2020, 5:01am

Responding to some of the comments:

No, renv will not solve the problem of 32-bit R versus 64-bit Java. It will still be necessary to install the 64-bit version of R only.
We did test renv on MacOs and Linux, and it seemed to work fine (if you don’t use a 3-year old xCode install)

I’m all open to Docker, but would appreciate some instructions on how to get that running on my machine (I failed to get it to run on Windows in the past). Could anyone point me to instructions to install Docker on Windows and for example run @SCYou’s study?

rookie_crewkie · May 21, 2020, 7:41am

Hello @schuemie,

Installing Docker on Windows has become much easier recently. Docs: https://docs.docker.com/docker-for-windows/install/

The old way was to install Docker Machine (Docker engine + Linux VM), which didn’t work for Win Home at all and sometimes required secret shaman rituals to succeed. With WSL2, Docker can be installed on Home edition as well.

Chris_Knoll · May 21, 2020, 2:43pm

I would like some help on this too (like @schuemie). I can get my mind around attaching to a docker image and getting it installed on windows (and can confirm that with a recent update of windows, they made bios settings to enable virtualization and other hurdles obsolete). What I can’t get my mind around is configuring the R environment for new projects or distributing a configured docker…do we host a docker repository? I don’t think we just check in an image to github (although I think I understand that there’s a dockerfile which provides the instructions on initializing the context), I apologize that my knowledge on the topic is a bit limited, but I am trying to understand if there is a redundancy between renv and docker or not… My vision of the workflow here is:

Develop your OHDSI study (outside of docker, just a normal R session environment).
Capture your dependnencies with Renv (or use @schuemie script to build the renv.lock file)
— development done
Get a clean docker for R
use renv to initlize the Renv
capture the docker state and make available for others to instantiate it.
— deployment done

Here’s where my gap in understanding with the docker is: do you bundle any of the dependent assets that get executed in the docker image with the docker image, or is docker just a series of comands executed via dockerfile and within that file you perform all the startup work to get the assets installed? I’m basing this question on what I see here.