Strategus design discussion

schuemie · July 22, 2022, 6:32am

There has been a lot of discussion on Strategus, and specifically on the current design where each module has its own renv lock file. I think this is essential for Strategus to work, and have created the following simple scenario to try to explain why. I fully expect reality to be much more complex (favoring the renv-lock-file-per-module design), but this was the simplest scenario I could design to demonstrate the point.

Introduction to Strategus

For those unaware of what Strategus is: Strategus is both an R package and a new approach to doing studies in OHDSI that we’re developing. Strategus will have specifications in the form of JSON as input, and these specifications can include various modules, such as a CohortGenerator module, a CohortDiagnostics module, a CohortMethod module, etc. Each module will produce output that can be shared across the OHDSI network, uploaded in a results database, and viewed with a (modular) Shiny app. Basically, Stragegus makes creating an OHDSI study a matter of putting together pre-defined building blocks instead of writing a custom study R package. For the most part, OHDSI studies can then be executed by sharing JSON, instead of sharing R code, and researchers can define OHDSI studies without having to know R.

Scenario

We have a network of databases Charlie that has contracted to do studies for a regulator. We also have methods researcher Marvin.

Charlie starts a study Alpha using modules A 1.0 and B 1.0. This study will need to be repeated every 6 months.
After years of growing, module A is finally completely refactored, with breaking changes to the interface in A 2.0.
Module B 2.0 is released with an important new feature.
Charlie performs study Beta requiring B 2.0.
Six months have passed, and Charlie reruns study Alpha.
Marvin develops a new module C that needs to be executed with A and B.
Marvin finds some sites (some part of Charlie) that are willing to execute methods study Gamma with modules A, B and C.

When using renv lock files per module

Having a lock file per module allows you to have different versions of the same module side-by-side.

Before 1, all sites in Charlie install Strategus with modules A 1.0 and B 1.0.
At 1, all sites run study Alpha. Based on the study specs, Strategus automatically selects modules A 1.0 and B 1.0 to execute.
Before 4, all sites in Charlie install modules A 2.0 and B 2.0, in addition the A 1.0 and B 1.0.
At 4, all sites run study Beta. Strategus automatically selects module B 2.0 to execute.
At 5, all sites run study Alpha again. Strategus automatically selects modules A 1.0 and B 1.0 to execute.
Before 7, Marvin creates a custom module C, with specs for study Gamma that tell Strategus where to download C.
At 7, the sites included in the study run Gamma. Strategus automatically selects the right versions of A and B, and automatically downloads C.

When using a single lock file for all of Strategus

Having a single lock file means you cannot have multiple versions of the same module. You can have different Strateges instances, each with their own specific versions of modules. The set of modules included is fixed.

Before 1, all sites in Charlie install Strategus 1.0, containing modules A 1.0 and B 1.0.
At 1, all sites run study Alpha using Strategus 1.0.
Before 4, all sites install Strategus 2.0 containing modules A 2.0 and B 2.0.
At 4, all sites in Charlie run study Beta using Strategus 2.0.
At 5 there’s a problem. The study alpha specs no longer match the current version of Strategus. We can either try to transform the alpha specs to A 2.0 and B 2.0, or ask all sites to run the study on Strategus 1.0. Both choices are fraught with issues.
At 7, Marvin is out of luck. His module C isn’t ready yet for inclusion in Strategus, and it never will be because he can’t test it to show it is great.

Please let me know where you think I’ve missed something. Looping in @Adam_Black, @edburn, @anthonysena, @jreps , @Patrick_Ryan

Adam_Black · July 24, 2022, 10:15pm

Thanks for starting this thread @schuemie. I really appreciate the ability to have open discussions with differing points of view which is fundamental to science and an open source software community. It’s one of the things that makes OHDSI great.

To summarize my response in a single sentence: You are missing a specification of the system you are building and the lack of a specification is leading you to an implementation that is more complex than it needs to be.

A specification is a document that describes what the system should do. The description is agnostic about how the system is implemented. A specification could be implemented in any general purpose programming language. A specification should also be minimal in the sense that it should give the minimum set of requirements necessary to accomplish the desired behavior. Everything I’ve read so far about system design recommends that a system have a design spec especially if it might involve distributed or concurrent processes. Terminology used in the specification needs to be clearly defined. The specification should be written before the implementation but the spec does not need to be set in stone. The waterfall model of development apparently does not work so we should think of the spec and the implementation as being developed through an iterative process. What I see is that significant progress is being made on Strategus without a formal specification or much feedback/discussion from HADES developers.

Your description above is a starting point but I think it includes some details that are not necessary and lacks at least one key component, a module repository manager.

System Specification

Let’s define the system under consideration as the sum total of all requirements you described above. The complexity of the system can be loosely defined as the number of units of information that a programmer needs to have in their mind when working on a part of the system. The same system can be modularized in different ways and some modularization schemes will be more complex, and thus harder for developers to work on, than others.

A module is a part of the system that represents a set of responsibilities needed for the system. The system can be partitioned into modules by partitioning responsibilities. Modules can be “written with little knowledge of the code in another module” and “reassembled and replaced without reassembly of the whole system.” [1] Since modules are a partitioning, the responsibilities of one module should not overlap with the responsibilities of another module. A module is a mapping from inputs (e.g. JSON file, data) to outputs (e.g. csv files). Modules have an interface that describes how a module interacts with the rest of the system and an implementation that is internal to the module. Modules have both an implementation and interface that can change over time. The set of responsibilities associated with a module is defined when the system is designed and is expected to remain fairly constant over time.

A module version is a module at a point in time. Whenever a module’s implementation or interface changes, a new module version is created.

A study specification is a plain text file that describes how modules (i.e. building blocks) are put together to make an OHDSI study. Study specifications are the input to Strategus.

A site consists of three things

a place where modules can be installed
an environment for strategus to select modules and run them
One or more databases that are needed as input to the modules

Multiple versions of the same module can be installed at a site. Strategus automatically selects the correct module version to run during study execution.

There is one missing component to the system that I’d like to add. A module repository manager is the shared place where modules are installed from. Sites install modules from the module repository manager.

Study requirements

Let’s describe the study as an ordered set of module versions. (In reality I think this should be a partially ordered set since some modules can be executed concurrently but a complete ordering is simpler for this first pass).

Study Alpha = {A1, B1}

Study Beta = {A1, B2}

Study Gamma = {A1, B2, C1}

A study may not contain two different versions of the same module.

Since modules are partitions of responsibilities I’m not sure why one would need to use two different versions of the same module in a single study.

Modules in a must be executed sequentially from right to left. For example in study Alpha = {A1, B1}, the module B1 would be executed before module A1.

Module repository manager requirements

The module repository manager must make all module versions available to all sites at all times. Any site can ask the repository manager for a specific module version at any time.
The module repository manager also guarantees that the most recent version of all modules can be used together in a study.
Given a date, the module repository manager can return the complete list of module versions that were “the most recent versions” on that date. For each date the module repository manager returns a list of all modules that were available on that date and guaranteed to work together.

Module requirements

When a module is installed it is installed with all required dependencies. A key question is if modules need to specify their complete list of recursive dependencies or just their direct dependencies. Either way when a module is installed it must be treated as essentially self contained and be able to support its interface. An installed module = a module that has everything it needs to support its interface.

Site requirements

A site should be able to execute any study at any point in time.

Sites are able to install any module version from the module repository manager. Sites may have multiple versions of the same module installed at the same time.

Strategus requirements

Strategus is responsible for the execution of studies at a site. Strategus may or may not have access to the module repository manager but will have access to all installed modules at a site. At the beginning of each study Strategus instantiates an execution environment. In the execution environment all required modules for each study are made available. Since studies cannot contain multiple versions of the same module the execution environment will also not support multiple versions of the same module. After study execution the execution environment is completely removed and does not persist.

The input strategus receives is represented as an ordered set of studies.

{…, Gamma, Beta, Alpha} which is the same as {…, {A1, B2, C1}, {A1, B2}, {A1, B1}}

Strategus can be in one of the following states:

Waiting for study
Instantiating study execution environment
Executing study modules in order
Removing study environment

The state transition here is pretty simple 1 → 2 → 3 → 4 → 1

I do think that state 3 might need to be modified if we want to have concurrent execution of modules and we need to consider how an error state might be handled.

This is a first pass at a specification for the system you’re describing that needs iteration and refinement. I think this system could be implemented in R using functions and packages to represent the modules. After all, there are already tons of modules implemented in R using functions and packages. I don’t think there is a need to come up with a new representation of what a module is in R.

Since we want studies to execute with the exact same code every time I think it would make sense for a study to include a complete list of all recursive dependencies (renv.lock file). This is exactly the use case for renv. It looks to me like the current implementation of strategus is using renv to create a whole new implementation of modularization in R where modules include a complete list of their recursive dependencies (example). Your current implementation has both packages and “modules” essentially doubling the number of Hades analytic code repositories. I don’t think modules need to contain a complete list of recursive dependencies for this system to work.

I’ve been critical of using R packages for studies in the past and now I’m advocating their use for modules so I want to clarify my perspective.

Study specification = plain text file(s) that describe what the study is
Modules = R packages and functions that partition the responsibilities of the system

Reference & resources

Parnas’ paper on modules: https://www.win.tue.nl/~wstomv/edu/2ip30/references/criteria_for_modularization.pdf
Leslie Lampour Lecture: https://youtu.be/-4Yp3j_jk8Q
TLA+: The TLA+ Home Page
A Philosophy of Software Design by John Ousterhout

Adam_Black · July 28, 2022, 1:13am

I think the problem of how to handle module dependencies is not so different from how to handle vocabulary dependencies. If I replace “module” with “concept set expression” and “module version” with “vocabulary version” we can see the similarities.

Circe-be cohorts depend on concept set expressions which in turn depend on a vocabulary version.

The same concept set expressions might resolve to a different list of concepts when a cohort is generated based on the version of the vocabulary being used.

The same R package might resolve to a different set of installed software when install.packages("Hades") is called based on the version of the package repository being used.

If the vocabulary mappings are improved the same concept set expression can (under certain conditions) automatically pick up the improved mappings without being changed at all.

If one of an R package’s transitive dependencies is improved (say a more efficient algorithm is implemented in a low level library) the R package will automatically pick up that improvement at install time without being changed at all.

With R packages it shouldn’t be too difficult to make multiple versions of a package available at a site. Suppose I have an R script I would like a partner to run. If internet access is available I could simply tell the partner to install packages from this url “https://packagemanager.rstudio.com/all/2022-07-26+Y3JhbiwyOjQ1MjYyMTU7RTk2NzAxM0Y” and they would get the exact same set of packages that I’m using in my environment (all of CRAN as it exists on July 27, 2022). (I realize it’s going to be hard to get all our R code on CRAN which is why I’d like to set up an OHDSI package manager.) This url will produce the same set of software at any point in time in the future. If internet access is not available then multiple package tarballs could be stored in a “package cellar” and renv could be used to restore a study specific R session. Of course there are also other environment dependencies like system libraries, R versions to consider as well.

I think the vocabulary dependency is actually a harder version of this same dependency problem because CDM databases are tightly coupled to a vocab version so you can’t update the vocab without also updating the ETL. A new vocab version means a new version of the CDM data must be created. Multiple vocab versions “installed” requires multiple CDM versions which can quickly require a large amount of storage space and ETL work.

@Chris_Knoll - does the similarity between the software dependency problem and the vocab dependency problem make sense to you?

schuemie · July 29, 2022, 11:59am

Thanks @Adam_Black , I think that is very helpful.

We do have specifications somewhere, originally written when there were just a few people working on Strategus (mostly @anthonysena and me). Me, I’m more a ‘the code is the specifications’ person, although I also admit I’m not a terribly good software developer.

I think your specifications are spot-on. However, I am going to disagree with you on the notion that we can have different interpretation to the same study, similar to how a concept set expression can resolve to different concepts depending on the vocabulary version. For scientific reproducibility, I think it is essential that a study is implemented as an exact set of code. Extending that to concept set expressions and the vocabulary, I strongly believe a study, once complete, should record the version of the vocabulary used at a site to run that study, for maximum reproducibility.

I think there’s an additional requirement that we haven’t expressed very well. Let me try and describe it as ‘the desire to separate installation and execution’. Some OHDSI partners very much want to be able to install Strategus with a set of modules, and freeze that for some time while they execute studies making use only of the installed modules.

You mention that ‘I think it would make sense for a study to include a complete list of all recursive dependencies (renv.lock file)’. I think this goes against this requirement. A lock file could contain dependencies not already installed at a site. I prefer your definition that a study specification is ‘a plain text file that describes how modules (i.e. building blocks) are put together to make an OHDSI study.’ The required dependencies should be implied by the study specifications.

I wasn’t aware you could use the RStudio package manager to reproduce CRAN at a specific point in time! We could actually do something similar on-the-fly for HADES by using the timestamps of the git tags.

Which brings me to the Module Repository Manager. I was thinking this role is already fulfilled by Github. We currently explicitly reference Github in the Strategus analysis specifications (see for example here). The only functionality it currently doesn’t fulfil is, given a date, provide the most recent versions at that date, but as mentioned before I think that should be trivial to implement somewhere. But could you help me understand where that fits in your design? At what point do we need to resolve a date to a set of HADES package versions?

Chris_Knoll · July 30, 2022, 2:01am

Hi, Adam,
I’m sorry I didn’t reply sooner; your post gave me a lot to think about and I wanted to take the time to give a clear understanding of my perspective. I hope that my thoughts align with Martijn’s.

In my mind, I don’t see circe cohorts and concept set expressions as having an explicit dependency on a vocabulary version. So, I feel this is different than saying ‘ggplot 3.4 depends on dplyr version 1.2 or greater’. I see cohort definitions and concept set expressions as just a way of expressing a query. It’s up to the user to decide if it makes sense to apply it on a CDM with one version of the vocabulary or another. The important thing is that if we do produce evidence from a cdm, it’s important to note which cohort definition /concept set expression you used (these could be versioned artifacts) and which version of the vocabulary you used, and maybe even which version of ETL was used to build the CDM. Any variation in these elements could lead to different results when you try to ‘reproduce’ it.

To illustrate this point as ‘it’s just a query’, consider the following

select drug_concept_id, count(*) as n from cdmSchema.drug_era group by drug_concept_id

You run this on one cdm, and you’ll get a result. The result depends on the version of the vocabulary, and the ETL that built the cdm. It depends on the vocabulary because concept_ancestor defines ingredient rollups. It depends on the ETL version because that defines which people got populated in the CDM tables.

So, should the query select drug_concept_id, count(*).... declare a dependency on a particular version of the vocabulary or ETL? I’d say ‘no’ but if i was to present the results of this query, I should capture the CDM ETL version and the vocabulary version in order to reproduce this result later.

I think I’d like to draw a distinction between using dependency versions to indicate ‘things work’ vs. indicating some unique identifier to the state of the artifact (ie: this is concept set v 1.0 (an identifier) vs. this concept set only works if you have vocabulary v5.3.24). If we had something like a versioned release of Phenotype Library, you could say things like ‘this is cohort 1245 (T2DM) from Phenotype Library v3.4’, allowing people to fetch the exact representation to attempt to reproduce your result. Same thing with Strategus modules: you declare which versions you want to use in a study, and if you want to reproduce it later, you know how to reproduce the execution environment by the version of the modules (which in turn embeds an renv.lock file to load the correct dependencies for that version).

As far as the ‘date-based’ package versioning that came out of rstudio/cran, no offense to the R devs and the brilliant work that they have done, but date-based package versioning makes no sense. Never have I filed a git issue where part of the issue was ‘as of what date was this software last working for you?’… it was always ‘what version are you running, and what version was it last working for you’. Another problem with locking the R packages to a date is that if you need a bugfix for one specific package that was released 2 months after your ‘freeze date’, if you advance the timestamp to 2 months later, you don’t just get the fix but you get everything else that was released in those 2 months. That’s why the renv.lock approach for managing the environment dependencies is far superior: you can explicitly choose which packages at which version you want to load into your environment. If you have to fix one dependency, you can control it that level.

Ok, I think I spamed enough I am not sure if I answered all of your questions directly, but if there’s something you’d like me to dig into specifically, please let me know and I’ll try to answer.

-Chris

Christian_Reich · July 30, 2022, 7:57am

Simple question from an (innocent) bystander: Aren’t the viewer modules dependent on the output producing modules? Or do you intend to lock in the format of the output indefinitely?

Which gets me to the next question: Are the JSON configuration/instruction/parameter files independently defined and versioned? Or do they live in complete dependency from the module versions?

schuemie · August 1, 2022, 7:12am

The viewer indeed depends on the output of a module. The output models will themselves be versioned, and a module will be responsible for providing migration scripts of its output that can automatically migrate to newer versions as required by the Shiny apps.

Are the JSON configuration/instruction/parameter files independently defined and versioned?

We haven’t implemented that yet, but that is definitely in the planning. We’re looking at JSON Schema to fully specify the JSON input, and these inputs will be versioned.

Christian_Reich · August 1, 2022, 8:25am

Sounds good.

Should be part of @Adam_Black’s specs.

Also: the idea of a module that does up and downgrading of the output of different method module versions (à la MS Word) so that newer vizualizers can process older version method modules and the other way around is an addition to the “all versions are independent and parallel” logic you are trying to establish. Nothing wrong, but that definitely needs explicit declaration.

schuemie · August 2, 2022, 9:21am

Thinking about what @Adam_Black said some more, I realize we’re trying to support at least two types of research networks (where a network can also just be the local site, like a university or pharma company using their in-house data). The first type I’ll call ‘closed’, where only software that has been previously installed (probably after careful vetting) can be used in studies. The second type is ‘open’, in that it does allow studies to use the latest and greatest, possibly even including custom modules.

Based on @Adam_Black’s comments, I see 3 options for handling dependencies in the Strategus context:

Each module specifies its own dependencies (a renv lock file per module).
Each study specifies its dependencies (a renv lock file per study).
We define ‘Strategus releases’ that include all module versions at that time (renv lock file across all modules).

Each option has pros and cons:

Option 1 pros:

Separates installation from execution, thus supporting closed networks. As long as studies use the module versions already installed at a site, no new software needs to be installed.
As soon as a new module version is released, open networks can use it in a study.
Allows for custom modules, with their own dependencies, to be used in open networks.

Option 1 cons:

Requires separate modules layer where dependencies are specified (i.e. where the renv lock files live)

Option 2 pros:

Allows flexibility in included modules, and could use custom modules
Does not require a separate modules layer. Each HADES package could serve as a module.

Option 2 cons:

Does not separate installation from execution: each study could require installing software. This would probably be unacceptable for closed networks.
Creating the renv lock file for the study will require deep knowledge of what you’re doing, and may be prone to error.

Option 3 pros:

Separates installation from execution, thus supporting closed networks. As long as studies use the installed versions, no software needs to be installed.
Does not require a separate modules layer. Each HADES package could serve as a module.

Option 3 cons:

Does not allow for the option to include custom modules that do need to be downloaded. This would probably be unacceptable for open networks.
Requires an extra layer of releases (at the Strategus level), so could be a long time before a feature needed for a study is available in a Strategus release.

jpegilbert · August 2, 2022, 3:43pm

@Christian_Reich, I have started some initial design specifications for handling changes to results models to allow any web apps/standard reporting tools to continue to work with new data.

I will likely start a separate thread for this, as its slightly outside the scope of the Strategus package.
This has been designed based on requirements I had for CohortDiagnostics, with code currently living in this PR.

We found @jreps also has this requirement and, as we develop Strategus modules, its extremely likely that most existing packages will want to make changes to a data model that means we don’t need to write one off hacks to support multiple ddls or results created with other formats.

schuemie · August 3, 2022, 1:33pm

We’ll have a Teams meeting to continue this discussion on August 9, 8:30am - 10:30am Eastern Time. Meeting link is here.

Christian_Reich · August 3, 2022, 4:13pm

Couple points:

I am probably not the one to drive these conversations.
This is very cool. However, it is probably a little hard to read. For example, you may show real-life examples of result schemas and their versions and how they differ and break.
You have two problems at hand: Result schema format incompatibility, and result content definition incompatibility. For example, you could change the result schema for an incidence report from generic to specifying the time interval, e.g. monthly. That breaks the format. Or, you the incidence calculator changes the way it handles the numerator or denominator. That breaks the meaning of the result. Right now, none of that is clear from the text.
Who is “azimov”, or why is this not part of the Strategus overall application?
(really Christian’s odious grammar pickiness: plural of composite nouns is formed at the end. So, it should really be called “ResultModelManager”, even if there are many results)

jpegilbert · August 3, 2022, 5:21pm

Thanks for the feedback christian.

you may show real-life examples of result schemas and their versions and how they differ and break.

This is a good point and will be helpful for developers implementing the package.

For example, you could change the result schema for an incidence report from generic to specifying the time interval, e.g. monthly. That breaks the format.

In this context it is up to the developer to make sure conversions are compatible and design any data transformations. This will always be a limitation and requires careful consideration when making design choices.

Or, you the incidence calculator changes the way it handles the numerator or denominator

If there are fundamental changes to the output of the source package this does create problems that it may not be possible to overcome. I fully, expect there will still be situations where results in a given format will no longer be compatible. I think the only way to overcome this Careful design choices can be made such that calculations are performed from data that is not removed.

However, it is assumed that this package will never have access to the original source patient level data. In this context I will try and think of ways where we can say that a given results model is now deprecated and can never be upgraded. This is still a big improvement, for example in CohortDiagnostics it is just assumed that previous versions are no longer compatible but they may work.

Who is “azimov”, or why is this not part of the Strategus overall application?

Azimov is my personal github… I didn’t want to create an OHDSI package until the initial specifications had been created.

(really Christian’s odious grammar pickiness: plural of composite nouns is formed at the end. So, it should really be called “ResultModelManager”, even if there are many results)

This is correct, the final ohdsi package will fix this (iff we keep the same name).

Adam_Black · August 10, 2022, 12:08am

Thank you all for the synchronous discussion today and for the posts on this thread. I found it helpful to have a chance to discuss the design, debate some of the tradeoffs that are being made, and listen to the perspectives of others in the community. I’ll paraphrase one of my favorite exchanges.

“OHDSI should define the interfaces and be supportive of any implementation” (i.e. Just as it does not matter what software you used to map source data to the CDM it shouldn’t really matter what software you used to map CDM data to results as long as it conforms to some set of standards.)

“Good interfaces are developed in tandem with their implementation so OHDSI should be creating the official implementation.”

I agree with both and I think those perspectives highlight an important tension in OHDSI.

I don’t want to stand in the way of Strategus development and will wholeheartedly support the design decisions of the other developers if there is general consensus. However, I do feel strongly that Strategus should be one way of implementing an OHDSI network study and not the only way to implement an OHDSI network study.

There were two definitions that I tried to draw attention to today:

An execution environment is a virtual machine or docker container with R, Rtools, python, (possibly anaconda), Java installed and an R package cellar that contains the tar.gz files for all Hades R package dependencies.

An OHDSI distribution is a complete versioned set of OHDSI software needed to run OHDSI studies.

Thanks for a great discussion. I’ll keep working on a spec and try to capture the requirements discussed today.

schuemie · August 10, 2022, 6:48am

Thanks @Adam_Black ! I too found it a good discussion, with lots of food for thought. To be clear: I do understand your concerns about the extra layer of complexity we’re introducing. I just don’t understand what the alternative is.

Something I think is important to mention is that the current approach to modularization isn’t just about standardizing the input. I agree we could have a more generic approach to creating inputs, maybe even having a 1-to-1 relationship between the JSON and R function calls (which is what we already have in many modules). But it is also about standardizing the output, and specifically about generating output that fits in a relational database. The output of most R functions does not fit in a relational database. So Strategus isn’t just about orchestrating the execution of various R functions in HADES packages, it is also about building a coherent and consistent evidence base from the outputs.

Adam_Black · August 10, 2022, 11:59am

I think output standardization will be an important contribution. Could this be thought of as an OHDSI results common data model? I’m somewhat aware of the common evidence model but don’t know much of the details.

My primary objection to the Stratgus design is that it allows for conflicting dependencies within a single study which is an unnecessary complexity. The alternative to one renv lock files per module (Option 1) is to use both options 2 & 3. I’ll try to explain.

Suppose today we deem the current versions of all Hades packages to be Hades v1.0. We set up RStudio Package Manager and load it with all the packages needed to use all of Hades. We create a static/frozen url that will forever point to the complete set of packages included in Hades v1.0 and all the required dependent packages.

If all our packages and dependencies were in CRAN (which would be ideal in my opinion) we could use this URL https://packagemanager.rstudio.com/all/2022-08-09+Y3JhbiwyOjQ1MjYyMTU7RERFQkQ0Qjg
(@Chris_Knoll Just treat the date in the URL is an arbitrary character string ) If we can’t get them on CRAN I’d suggest setting up an OHDSI instance of RStudio Package Manager.

When you create a study you have the option of building your study off of the frozen/static repository URL or the “latest” repository URL.

Open sites with access to the package manager should be able to run any study built off of the “latest” repository URL. Closed sites can only run studies built off of one of the previous “frozen” URL.

When closed site “installs” OHDSI they create an execution environment that could be a docker container or linux VM that contains, among other things, a package cellar. The cellar contains the tarballs of all versions of all R packages in the package manager up to the latest distribution and thus acts a local copy of the entire package manager.

My suggestion would be to make use of ExecutionEngine which provides a REST endpoint for R or SQL code to be executed inside the execution environment but regardless the study execution consists of just three steps:

Create a new R session and call renv::restore with the study lock file
Run the R code (Could be any R code)
Shutdown the R session

An outstanding question in my mind is if you are planning to integrate Strategus into Atlas/WebAPI or is to create a new UI.

I think the most important contribution of Strategus will be the standardization of outputs that everyone who wants to implement a study can use as guideposts analogous to the role of CDM spec in ETL.

More generally maybe there is an opportunity to define the minimum requirements of OHDSI study that are agnostic to implementation and allow us to compute on study specifications and results. Harold Lehmann had some work on this he presented at a previous Hades call.

Chris_Knoll · August 10, 2022, 2:29pm

I’d recommend against this: it leads to what is called ‘leaky abstraction’. I don’t think we’d want to see json properties that are associated to calls to getDbCohortMethodData() or createStudyPopulation(). These are implementation specific details, and just by standardizing on the inputs/outputs of a overall unit of work, you shield users from the lower-level details allowing you to change the implementation without breaking the standard (change as in you can refactor code/cleanup your methods without exposing some changes to the json specification).

The reason why Strategus Modules are so closely mapped to the json specifications we’ve built so far is that the responsibility of the module is to map the json payload to the set of function calls that are needed from the underlying implementation. Circe is a good example of this in action, and while the only public implementation is spec-to-sql implementation, there’s no reason why a spec-to-document_store could be implemented to translate the cohort expression specs into document-store queries. These Strategus modules are currently spec-to-HADES_packages but there’s no reason why they couldn’t be implemented as spec-to-SAS_packages. I don’t see much value in HADES trying to find alternative implementations for these specifications, but it does introduce the possibility for other communities to engage (compete?) with HADES implemntations.

I don’t think we can be that ignorant about how this is functioning: being a time-based value (the rpackage manager devs even describe it as a ‘snapshot’), the progression between releases is linear. How would you support the following:

                 o -- Hades 1.1--- Hades 1.2             o - Hades 3.1
                /                                       /
Hades 1.0 ---- o ---- .Hades 2.0 -- o ---- Hades 3.0 - o --- Hades LATEST
                                     \
                                      o --- Hades 2.1 --- Hades 2.2

In case you’re wondering, this is based off the release branching of WebAPI. Hades 1.2 may have been released after 2.0 since there may be some need to stick to the 1.x line and it was released after 2.0 was complete for those who were unable to shift to the 2.x line for some reason (hint: andromeda, hint: study packages changing to renv). renv lock files solves this problem because an renv lock file contains the specific versions of things, regardless of the time interval they were released. The primary reason I dislike cran’s timestamp-based approach is that we have no control over what’s in cran at a given date so we’re forced to pick up all other version updates when we just want to grab a new one that was released on date X. If OHDSI owned it’s own RPackageManager instance, we could control all the versions of the libraries, but I still don’t see how we could do ‘side-releases’ (like HADES 1.2 along side HADES 2.1).

Thanks again for the great conversation, and I hope we can come to a decision that everyone will appreciate.

Adam_Black · August 10, 2022, 6:39pm

Chris_Knoll:

How would you support the following:

                 o -- Hades 1.1--- Hades 1.2             o - Hades 3.1
                /                                       /
Hades 1.0 ---- o ---- .Hades 2.0 -- o ---- Hades 3.0 - o --- Hades LATEST
                                     \
                                      o --- Hades 2.1 --- Hades 2.2

What I’m proposing would not support this (as a feature not a bug). It would only support monotonic versioning for Hades R packages. I feel like this is trying to make R work like Java (renv.lock ~ pom.xml). Non-monotonic versioning is a complexity we don’t need to introduce. In your example there are 3 different versions of Hades (~24+ R packages) being actively maintained so the current maintenance burden is multiplied by 3. Do we really need it?

jmethot · August 10, 2022, 7:02pm

I’ve been following this discussion closely. Thank you for having it “out in the open”.

I’m not an R user or expert. I am a leader in an Informatics department at a healthcare institution that is just embarking on the OHDSI journey. I hope in a year or two our department is being called on to support numerous investigators participating in OHDSI network studies in oncology.

From that perspective this discussion seems to assume network participants will manage the byzantine R ecosystem including R versions, CRAN libraries and access, etc. I realize one goal of this discussion is to simplify that.

@Adam_Black wrote:

An execution environment is a virtual machine or docker container with R, Rtools, python, (possibly anaconda), Java installed and an R package cellar that contains the tar.gz files for all Hades R package dependencies.

I’m inferring a comma before the “and” and that the expectation is the local site will configure the proper R environment. If the intent is that the docker image contains the study-specific R environment then I’m happy and you can stop reading.

If not, I propose it should. The ultimate simplification is to package both the execution environment and the specific R environment into a single docker image per network study (which obviously might be versioned). The docker image would be parameterized with the CDM connection details and whatever else is necessary to execute against the local CDM and store/return results. The Strategus project would create packaging tools (or at least instructions) for study designers to produce the docker image, and a docker repository to manage and serve the study images.

This would permit unlimited network studies to proceed in parallel without anyone but the study designer(s) thinking about R versions and libraries, including HADES. It has the added benefit of permitting non-R objects in the network study docker image, such as shell or Python scripts, reference data, etc.

One potential obstacle might be R licensing: network sites might be executing R from within an image but without a local license. Is this moot since R itself, and most R libraries, are open source? RStudio is separately licensed but doesn’t seem necessary for scripted study package execution.

Adam_Black · August 10, 2022, 7:20pm

Thanks for participating in the discussion @jmethot!

There are different ideas in the community about how to use Docker. Odysseus maintains an R execution environment Docker image freely available to anyone with all the components I described except for the package cellar (because we are still trying to get consensus on what should be in a Hades Distribution). https://hub.docker.com/r/odysseusinc/r-env

Some people have experimented with study specific Docker images. Here is an example.

Some organizations can’t use Docker at all for security reasons. Some organizations don’t want to be downloading a new Docker image for every study but would install and update docker images once per release cycle. Other organizations and people prefer the entire study to be Dockerized. That is my understanding of the current state of affairs around Docker use. If we get consensus on a standard for the execution environment I think the community would support multiple implementations using docker, or cloud formation templates, or VMs that would make it easy for sites to install and use.