OHDSI Home | Forums | Wiki | Github

Component re-usability

Wanted to raise the proposal for OHDSI community to discuss and establish some basic coding principles and best practices. It is very important that the R code produced today is being written and built with re-use in mind. This is especially true for many R-based analytical methods that are built to be distributed across multiple organizations.

I am not an expert in R, but based on some quick reading I have done - all of the below is applicable in this language as well. Here are some ideas - please share your thoughts:

  1. PARAMETERIZED MODULES: Today, in many instances, instead of having modules that have a general entry point function with input/output parameters, it is sometimes assumed that someone will actually need to edit the “main” function and adjust the actual code in the module itself. Best practice should be to wrap it up into the “run” function that have a number of input and output parameters. Write another environment specific module that references re-usable component and provides instance specific arguments as input into that fuuncrton

  2. SELF-CONTAINED. If it is expected that the module will be distributed, make sure to use packrat to package a re-usable part into a single distributable. This will ensure that proper dependencies are bundled vs. being dynamic resolved (which should be used in development environments). Then use #1 to invoke it

  3. VERSIONED: version packaged modules using versioning best practices e.g. using release.major.minor stamp

  4. ATOMIC: In the code that is being distributed, treat execution as a single atomic transaction and do assume or rely on caching that might not exist in another environment. Clean up temp objects after execution is completed to ensure it can be run multiple times.

  5. EXCEPTION HANDLING: do proper exception handling with trycatch. Make sure to check business conditions and raise business meaningful exceptions vs. system generated exceptions when accessing objects that failed to be created. If exception, terminate gracefully and clean up any temp object that might have been created.

1 Like

@gregk:

Isn’t this the same dicussion as here?

Thinking about it: Probably would make sense to start a Workgroup on this subject. Maybe this is already the Architecture Workgroup. If I went there I would know, but I really am running out of time participating in all the OHDSI activities.

Can you guys figure it out? @Frank, @Chris_Knoll, @Gowtham_Rao, @Vojtech_Huser, @Ajit_Londhe? Who else?

@Christian_Reich

it is related but definitely not the same. Both relate to best practices and standards. However the link you provided is discussing how we organize design and code into packages during exchanges between site. This thread is about how to code in a such way that it is actually re-usable and works with minimum changes, or works at all without a need to do lengthy code debugging and modifications.

My thoughts:

  1. PARAMETERIZED MODULES:

The Methods Library is a set of R packages that contain functions. Each function has the properties you mention, with a clear in- and output. Researchers can re-use these functions to perform a study with a few R commands.

More and more people are now calling for some sort of meta-functions, that takes as input the specifications of a study, execute the whole study, and spit out all the study artifacts. I’m happy for people to create such meta-functions, but I believe there are some large limitations with this approach. The original choice for creating the Method Library as a set of separate functions was deliberate; In my experience, most studies are unique in some way, and if we have a single meta-function, we would still need to modify that meta-function for almost every study, defeating its purpose. You simply cannot avoid the need to learn how to write R to implement OHDSI estimation studies.

  1. SELF-CONTAINED.

I have played around with packrat, and I agree it has many nice properties. However, I do see one very practical problem: for a typical OHDSI study package, the packrat library will be around 500MB. Github may not appreciate us having many repos with many GBs of data. For now, I’m using the less-optimal approach of simply recording the versions of all packages needed to run the study, as for example done in this study.

  1. VERSIONED:

The OHDSI Methods Library packages already follow versioning best practices in my humble opinion.

  1. ATOMIC:

I’m not sure what the issue is here.

  1. EXCEPTION HANDLING

I think this relates to issue 1, and the desire to have meta-functions.

t