Standardized analysis folder structure

pavgra · January 18, 2018, 12:51pm

Today we have multiple analysis types available in Atlas: Cohort Characterization (Heracles), Incidence Rates, Population Level Effect Estimation, Patient Level Prediction. One of the core features of Atlas is that it can generate code for the created analyses and allow users to execute them externally.

But the first problem is that code export tabs produce not self-sufficient code (packs). E.g. PLE generates just one R code file, which relies on the assumption, that in DB, which I am going to use, should exist three pre-generated cohorts (target, comparator, outcome). Otherwise it will not work. So, I have to export those cohorts first, execute them manually and only then I will be able to run the PLE code successfully. Which, for me, seems to be not the right way of how export feature should work. If I export something, I should be able to run it anywhere without (or at least with minimum amount of) any pre-conditions fullfilled. That’s why in the case, I would suggest to export three cohorts together with the main R code file and spin them off from the R. (Note one: self-sufficient exported code pack. Note two: many source files)

Next, after we exported several code files, executed them, following happens: hundreds of RDSes, couple of images, may be something else appeared in my analysis root folder. So absolute mess. Which are my code files, what was generated as result? Hard to answer. (Note three: many results files)

Ok, let’s go further: assume that I am personally not a data scientist or I just want to check summary of results without exploring whole results pack and spending a lot of time. Seems like having a kind of summary file, which could have brief results of executed analysis, would be useful. Moreover, it would be good, if it was more-or-less standardized, in format of json or csv. Then it would be possible to automatically parse it for further more UI friendly representation in web / pdf / etc. (Note four: lack of machine-readable results summary)

The last for now, but not the least, it would be also helpful to have some description attached to an exported analysis: what is the code doing, what are expected results, etc. (Note five: lack of docs)

So, having said these all, I would like to propose and discuss with OHDSI participants conventions for analyses organization and their folder structure. After experiencing the problems above, I’ve researched some existing approaches in data science field (few links as examples:
https://drivendata.github.io/cookiecutter-data-science/#directory-structure , http://projecttemplate.net/getting_started.html, https://medium.com/human-in-a-machine-world/folder-structure-for-data-analysis-62a84949a6ce), looked through analysis types, which OHDSI currently have and checked how they would fit the structure, and feel that following folder structure could be used as starting point:

/docs
Description of what these all is about, expected inputs and outputs
/libs
Libs or Packrat bundles used in the analysis and snapshotted to provide reproducibility
/src
R / SQL / Pyhton files
/data
Input JSONs / CSVs with params (e.g. JSONS for PLE and PLP, or Cohort exported from Atlas)
/results
All files generated during analysis execution go here
/results/summary
Machine-readable JSONs and CSVs with main results
/results/reports
E.g. reports generated based on Heracles stats

Such pack should be self-sufficient and do not depend on state of used DB. Also it should be reproducible and be as pure (in terms of functional programming) as possible.

Would be grateful for your opinion and criticisms.

Patrick_Ryan · January 18, 2018, 1:13pm

Thanks @pavgra for raising this topic. I agree that establishing
community conventions for the organization of an analysis package to enable
appropriate management of documentation, source code, and related objects
would great advance design and execution activities, and an organization of
the results artifacts would streamline review and interpretation. I think
this specifically represents the point where we need to more closely
integrate the ‘back-end’ methods library best practices, that @schuemie,
@msuchard, @Rijnbeek, and @jennareps have established for population-level
estimation and patient-level prediction, with the ‘front end’ UI experience
in ATLAS that @anthonysena and @Chris_Knoll have developed. I know @Frank
and @gregk have introduced this topic on Architecture workgroup calls
recently, and think it’s a really good idea to make this a key 2018
objective. At a 100,000-ft view, it seems we want an analysis object to be
fully-self contained and completely portable and independently executable,
which means we’d like to efficiently create something that can manifest as
a stand-alone R package without any external dependencies. Likely separate
from that, we also want some results object specification that allows for a
portable transfer of aggregate results (in the form of data objects and
images of graphs), but with some explicit reference (or embedding) of the
analysis object that gave rise to the results object. If these objects had
complete specifications, I can imagine that conducting network analyses
across institutions, hopefully through an open-source platform like
ARACHNE, would be much more manageable and efficient. I’m happy to defer
to the other experts tagged on this note and others, but I share the desire
to see something move forward.

schuemie · January 18, 2018, 1:42pm

I agree, an important topic!

We already have some structure we have used in the past, mostly dictated by R and the structure of R packages. For example, recent studies such as this one used the following structure:

/man for reference documentation
README for introduction and quick start guide
/inst for necessary input data, such as cohort definitions and analyses specifications
/R for the R code

I think we should move to packrat, so that would add a packrat folder:

/packrat

The results are written to a user-specified folder (can’t have them in the R package), and contains only summary results, so not the hundreds of intermediary files:

[user folder]/export

YuriK · January 19, 2018, 8:30pm

The idea to create some structure looks very attractive as for me. At least it will simplify code and results distribution for OHDSI network studies. For now, there is a strong overlap between the ideas of @pavgra with existing AlendronateVsRaloxifene study.

Similar folders are:
/docs corresponds to /man
/src corresponds /R and /inst
/data corresponds to /inst
As for me we definitely should reserve some folder for third-party libraries and packages. So having folder called /libs is a good idea.
Also, it makes a lot of sense to move packrat to separate folder. It will be very intuitive and easier to maintain.
Speaking about a folder with results, I am not sure whether it is possible to structure it for all possible studies in some common way. So I will better leave some flexibility to the user at this point.

BR,

Yuriy

gregk · February 2, 2018, 11:05am

here are some thoughts out loud, probably a bit broader than the original post intended, for the purpose to describe the big picture a bit more.

GOAL: We need to create a standard folder/package structure for the purpose of exchange analysis designs, code and receive results of the execution back, as package. The idea is to enable a pluggable architecture and automatic and consistent analysis processing and execution.
With that said, I would start from a high level package structure:

/meta
/design
/code
/results
/doc
META: this folder could contain key metadata information about package content
DESIGN: The analyses could be of many types - including cohorts (counts, characterization), patient level prediction, population effect estimation, incidence rates, TxPathways etc… I would propose to agree a common, platform-independent standard definition on how we describe the design of those. JSON sounds like a good format. Each JSON definition, in addition to describing the actual design (self-contained), must also include some additional metadata fields e.g. analysis type, author, GUID, etc…

This design can be then placed /design folder. It can be used in multiple ways:

use it to import it into another tool
or use it to populate arguments in the code that must be executed

again, this is a much broader discussion and could warrant its own WG

CODE : the code can also be of many different types - SQL, R, Python, Java. Thus, the sub-structure of the code below “code” folder would depend on what type that is. Regardless, we should think about how to adopt a consistent structure within those types.
RESULTS : will contain the output results of the code execution. Will need to agree on what sub-structure of this folder will look like for each analysis type.
DOC : contains a human readable information on what needs to be done if the package needs to be executed manually

There are many more details we need to discuss for each one of “folders”, but we should start here

Christian_Reich · February 2, 2018, 2:32pm

Should we call it ENVELOPE? So it’s clear that it is visible to the outside? The inside is also visible, but we may have encrypted versions for private network traffic in the future.

I like this idea.

gregk · February 2, 2018, 2:43pm

I like this idea of using it to specify access controls to different folders and data. ENVELOPE could be an interesting idea.

And I think the content of that will change depending on “SENDING IT FOR EXECUTION” or “RECEIVING RESPONSE”

Also, we also need to be conscious this structure should support the manual execution use case e.g. someone who has not enabling automation yet - who will download it, unzip it, run it and code willl produce results in correct locations, update the envelope, zip it up and send it back.

mkwong · February 2, 2018, 4:43pm

It is very common in software development tools to use “/SRC” instead of “/CODE”

MK

pavgra · February 5, 2018, 3:32pm

Good point. Also, in the same way, more abstract and common term for “DESIGN”, as for me, is “data” / “input”

YuriK · May 29, 2018, 9:22am

I have spent some time investigating the typical structure of existing OHDSI packages. As for me if we want standardization to be actively accepted within the community it makes a lot of sense to analyze what already is being used and why.

Most of the packages have typical files like

.Rbuildignore
.gitignore
DESCRIPTION
xxxx.Rproj
NAMESPACE
README.md

Most common folders are

R (source R code)
man (.Rmd files for manual generation)
inst/sql (sql templates and cohorts)
inst/java (jar files; default path for rJava::.jpackage)
inst/settings (csv, txt with cohort settings, environment snapshot, feature extractor settings, etc.)
documents (pdf files with study descriptions and related literature)
extras

There are some other folders that look very logical for me, but they are rarer than those mentioned above.
tests (R code with unit tests)
.settings (jar, .pers, jvm and java settings)
java (.java files, javadesc)
scr (.cpp code)
vignettes (.Rmd)
demo (R script with demo code)
data (.rda files with simulated data, typically required for demo code )
logo
packrat
docker
man-roxygen (.Rmd files)

Some custom folder that occurs within one or few packages

inst/cohorts (json, txt with cohort definition)
inst/markdown (.Rd)
inst/circe (json)
inst/doc

Within the mentioned folders, I do not clearly understand the role of “extras” folder as its content is different in various packages. Sometimes it contains R scripts, sometimes pdf files. Another question is related to “man-roxygen”, when it should be used instead of “man”.
@schuemie,@msuchard, @Rijnbeek, @jennareps maybe you can explain more details about the role of this folder?

BR,

Yuriy

schuemie · May 29, 2018, 9:35am

Hi @YuriK

I use the extras folder to store anything that I need as the package maintainer, but is not part of what the user needs to execute. I also store the PDF of the package manual there, so I can link to it from the main GitHub page. Normally the PDF is not considered part of the package.

The man-roxygen folder is the required name of the folder where documentation templates should be stored. For example, in the DatabaseConnector repo the parameters for the various database platforms are described only once in man-roxygen, and I then invoke that template here and here.