Strategus design discussion

schuemie · August 11, 2022, 8:37am

Thanks all for a great discussion so far!

To hopefully help the discussion, I’d like to distinguish between 3 topics:

Module in and outputs
Execution environment specification level: module, study, HADES snapshot
Nature of execution environment: renv lock file vs. Docker

1. Module in and outputs

I think inputs should be fully defined, so not something as generic as

Function name (e.g.‘matchOnPs`)
Arguments (eg.‘maxRatio = 100`)

But fully detailing the valid keys and values, so whoever develops the editor doesn’t need deep knowledge of the HADES packages. The input can be a complex tree structure. I created an example JSON Schema here, which I think we can extend to all modules.

The outputs should fit in a relational database, but the exchange format is CSV files, because those are human-reviewable to make sure no sensitive data is being shared. An example data output format is here.

One thing we haven’t discussed much is dependencies between modules. For example, most modules require cohorts to be generated first. We currently capture this in the module meta-data, e.g. here, which I prefer over capturing it in the input specifications (where users can make errors).

2. Execution environment specification level

We currently have this implemented at the module level: each module carries a renv.lock file. This provides a lot of flexibility while also providing isolation (I think someone used the term ‘separation of concerns’ which I like). The downside is increased complexity in another way, as @adam_black pointed out, where we introduce a new type of modules in the R world.

I’m starting to warm up to @adam_black’s idea of a hybrid between study level and HADES snapshots: Most studies would use specific HADES snapshots (e.g. created twice per year), but if someone needed the latest version of a module you could include a study-specific renv lock file, and sites could decide whether they want to run this or not. This has the advantage that we know for sure the modules work together nicely. It has the disadvantage that it does not have the aforementioned isolation. It also provides less flexibility: if you want to use a new version of one module, you are forced to also use the latest version of other modules in your study. I’m not sure that is necessarily a bad thing.

3. Nature of the execution environment

We’re currently focused on renv lock files to specify the execution environment. These are not ideal, as they do not capture the R version, or other dependencies such as Java and Python. They do have the advantage they work inside R, so are much easier to deploy than Docker images. My non-representative survey showed 11% of OHDSI sites cannot use Docker. (My site is one of those, so I’ve never had the pleasure of toying with Docker, which may bias my perception.) A Docker image per study seems problematic because of security reasons and the sheer size of these containers (I think), but @adam_black’s hybrid approach may alleviate those concerns.