Today we have multiple analysis types available in Atlas: Cohort Characterization (Heracles), Incidence Rates, Population Level Effect Estimation, Patient Level Prediction. One of the core features of Atlas is that it can generate code for the created analyses and allow users to execute them externally.
But the first problem is that code export tabs produce not self-sufficient code (packs). E.g. PLE generates just one R code file, which relies on the assumption, that in DB, which I am going to use, should exist three pre-generated cohorts (target, comparator, outcome). Otherwise it will not work. So, I have to export those cohorts first, execute them manually and only then I will be able to run the PLE code successfully. Which, for me, seems to be not the right way of how export feature should work. If I export something, I should be able to run it anywhere without (or at least with minimum amount of) any pre-conditions fullfilled. That’s why in the case, I would suggest to export three cohorts together with the main R code file and spin them off from the R. (Note one: self-sufficient exported code pack. Note two: many source files)
Next, after we exported several code files, executed them, following happens: hundreds of RDSes, couple of images, may be something else appeared in my analysis root folder. So absolute mess. Which are my code files, what was generated as result? Hard to answer. (Note three: many results files)
Ok, let’s go further: assume that I am personally not a data scientist or I just want to check summary of results without exploring whole results pack and spending a lot of time. Seems like having a kind of summary file, which could have brief results of executed analysis, would be useful. Moreover, it would be good, if it was more-or-less standardized, in format of json or csv. Then it would be possible to automatically parse it for further more UI friendly representation in web / pdf / etc. (Note four: lack of machine-readable results summary)
The last for now, but not the least, it would be also helpful to have some description attached to an exported analysis: what is the code doing, what are expected results, etc. (Note five: lack of docs)
So, having said these all, I would like to propose and discuss with OHDSI participants conventions for analyses organization and their folder structure. After experiencing the problems above, I’ve researched some existing approaches in data science field (few links as examples:
https://drivendata.github.io/cookiecutter-data-science/#directory-structure , http://projecttemplate.net/getting_started.html, https://medium.com/human-in-a-machine-world/folder-structure-for-data-analysis-62a84949a6ce), looked through analysis types, which OHDSI currently have and checked how they would fit the structure, and feel that following folder structure could be used as starting point:
Description of what these all is about, expected inputs and outputs
Libs or Packrat bundles used in the analysis and snapshotted to provide reproducibility
R / SQL / Pyhton files
Input JSONs / CSVs with params (e.g. JSONS for PLE and PLP, or Cohort exported from Atlas)
All files generated during analysis execution go here
Machine-readable JSONs and CSVs with main results
E.g. reports generated based on Heracles stats
Such pack should be self-sufficient and do not depend on state of used DB. Also it should be reproducible and be as pure (in terms of functional programming) as possible.
Would be grateful for your opinion and criticisms.