Dissemination of OHDSI results

schuemie · January 25, 2017, 9:34am

In OHDSI we perform studies across our network of databases generating results. The way we currently disseminate these results is through papers such as our recent paper on treatment pathways for chronic diseases. However, the idea of a static PDF of eight pages as our end product does not fit OHDSI’s tendency to doing large-scale analyses and building informatics solutions rather than performing studies.

How should we disseminate OHDSI results? Some ideas to put one the table:

We could have a database server that just houses all results from each study in its own database, without trying to harmonize across studies.We briefly tried that in the past.
We could create a single database of all OHDSI results, and define a unified data model that could hold all of that. FYI: I proposed something like that several years back, and called it FLOC (Formalized Latent Observational Characteristics). It was a nightmare we abandoned, since the model become more complex than the process for generating the data.
We could build interactive apps per study for navigating the result set. I’ve recently converted to shiny-ism, and we could easily make interactive apps like this example using CDC data. What I like about this particular example is that it is continually updated. There is no publication date after which the data collection stops.

These options aren’t mutually exclusive. I think all options still requires a paper to accompany the result set or app, for one because peer review ‘validates’ the science behind the data. (Insert your snarky comment on peer review here )

I already discussed these ideas with @Rijnbeek. I’m looking forward to what other people have to say!

rkboyce · January 25, 2017, 1:50pm

I think that topic is interesting for OHDSI folks to discuss. I wonder what the community’s use cases would be. That would help clarify what kind of solution would be actually used.

Although the NIH requires investigators to have some plan for dissemination, it has historically been an activity that tends to have little or no benefit to the researcher but real potential benefits to the research community such as transparency, reproducibility, and greater collaboration. A exemplar of the state of the art of translational medicine data dissemination is transMART. There are several demo transMART instances that one can look at. My impression is it is a pretty basic, though generic, data sharing platform for biomedicine but that folks seem to be adding capabilities to expand its usefulness.

Certainly, within the OHDSI research network, we would like to achieve the goals of transparency, reproducibility, and greater collaboration. To make the solution we build “sticky”, I think a more direct connection has to be made with publishing, which automatically begs the question of how a data repository and tooling would be maintained over the long haul. Another issue is on how to incentivize data dissemination.

What I mean by a connection with publishing is that the published papers need to provide some manner of directing readers to the data and tools for working with that data. Allot of digital infrastructure exists that can be used for this. For example, full text articles deposited into PubMed Central can refer to supplemental data tables that link from scientific claims or data presented in the manuscript to a database that holds the original research data and some way to explore it. A very interesting example of this is Microattribution and Nanopublication as Means to Incentivize the Placement of Human Genome Variation Data into the Public Domain. An actual example publication was published in Nature Genetics where the supplemental table refers to genetic phenotype claims, the data repository where the source data can be found, publications (using PubMed identifers) that report more details on the claims, and investigator identifiers that provide “micro-attribution”. The databases with the data can have tooling for exploring the data and acquiring it. It was reported that this simple architecture greatly increased data contributed to a public database. I would observe that if the links to the data were established as persistent URLs (PURLs), data access should be very robust over time since PubMed Central is likely to be permanent and the PURLs can be edited to point to new locations should the data repository host change. Another advantage, noted by the authors, is that the mode includes a couple of levels of peer-review (the published manuscripts and another possible review of the micro-attributions) and the potential to reward scientists in a way similar to journal citations.

Just some thoughts…

Chris_Knoll · January 25, 2017, 3:49pm

Hey, @schuemie.

One comment about your 3 options there: it seems like #1 and #2 are describing how the data’s persisted (datastucture per study or universal structure that all studies must write to). #3 is around viewing the results, but could either be a shiny app per #1, or 1 shiny app that can read any study results in #2’s format. So my question is: was there an implied datastucture for #3 that distinguishes it from #1 and #2?

Re: the idea of sharing results: a long time ago I had the thought of hosting an OHDSI web service which would be able to accept a ZIP of results data and a unique “network node id” (ex: Columbia would have an ID, Janssen would have an ID,etc) and by submitting the results it gets unzipped and imported into a results repository that is keyed by the network-id of the submitter. I was thinking of using either client certificates for signing the data (for encyrption as well as identity authorization) but with the recent work we’ve done with OAUTH, people could just authenticate with their own provider and we’d accept OAUTH credentials to identify the party. How the results are serialized could be either by-study (#1) or in a univeral structure (#2), and we could create a viewer to view the results across network-nodes (as a shiny app sounds fine to me).

Attached to the results is the version of the study so that only results are viewable side-by-side if they came from the same version of the study. If a network-node updates their data and re-executes the same version of the study, they can always upload new results and overwrite their results to the repository, or maybe we need the idea of ‘revisions of results’ so one network could up load different results for the same study…but this could be getting into ‘overengineering’ territory.

How I see the workflow is in 2 parts: a server/admin side and a network participant side:

On the server/admin side:
The study is defined, wiki pages are produced, etc and in case of #1 the datamodel for results is defined and the R Package is created to perform the study. This R Package could be based off of a template where you have to provide the study-specific serialization of the results and also implement a de-serialization on the results repository service to receive the results and persist. If we adopt #2 (univeral results structure) this doesn’t need to happen).

On the network participant side:
The study is pulled from the OHDSI repo (via drat or some other way to install the r package). The study is executed and local results are generated. Then, another function of the study-package (or if we can standardize this we use a generic results submission R tool package) will be used to zip the results up and post them to the OHDSI results repository via HTTPs using some sort of authentication mechanism such that when the results are recieved, the study version is verified and the results are persisted into the system, and are now viewable to the community. If a party decides that they do want to run the study but not share the results, they are free to do so.

So, that’s what I was envisioning as a ‘study dissemination’ life-cycle.

-Chris

jon_duke · January 25, 2017, 5:30pm

Thanks @schuemie for bringing this up and continuing to push us forward. (Nice shiny!) It’s clear that our strategy for dissemination will necessitate a requirements gathering phase where we define and prioritize our goals for dissemination. For example, how would we rank incentivizing researchers vs informing the public? Guiding clinicians vs communicating with patients? Spelling out the rigor vs conveying the bottom line?

Of course, we will be going for all of these things in one way or another. But the path forward will depend on the answers to these kinds of questions.

As much as I hate to say it, I think we need another Working Group

gregk · January 25, 2017, 6:51pm

Hi Martijn / All,

That is a very interesting subject to discuss. You are raising a few very interesting points, but first - I would step back and look at the Study from a life cycle perspective and make sure that we have various components to support that:

a. Initiate study (set objectives)
b. Build a team
c. Design study
d. Find appropriate data sets
e. Execute study - execute code (R, SQL etc…), collect results, iterate
f. Analyze and build insights Publish insights ("paper)

I think Atlas is quite strong in #c and you are looking at improving #e.

I think that you are right saying that these are not mutually exclusive. #2 is very important - it brings consistency to data collection and results interpretation. I think this is more about making sure that the results will preserve and contain core CDM elements #3 is equally important - associating results with a certain study. Visualization is a great feature to have but could be difficult to implement not from a technology perspective but from a business point of view as there could be so many variations of result data sets and resulting reports. I would add we also need to preserve the data lineage of data to make sure it is clear what original data source this data is coming from - I like Chris’s idea to have a network node id.

I also like the idea of publishing (#7) papers that would have a unique PURL associated to it. Again, it will need to be linked to a study as well as have a link to data sets used to produce it.

Basically, it is very important to maintain a link across ALL artifacts produced as a part of the study work and probably link them via a unique study id. It is very interesting to point out that current OHDSI tools do not view a study as a top level concept to which all work needs to be linked to.

Would be great to discuss those in more details, I am happy to share some more of designs and ideas we have developed in this space.

Christian_Reich · January 25, 2017, 7:18pm

@gregk:

I like your idea of linking the dissemination to the creation. Without it, we’ll drown. Particularly, since we don’t really have a “we”. There are all sorts of folks who want and should want to do research, and they want that to be disseminated. I also know that you are cooking something you want to commit to the community. You should start talking about it, so we have something tangible to discuss.

In particular:

That is a huge problem. We have no way of actually pointing to the data, even though we ourselves talk about full transparency all the time. “Truven CCAE” or “PharMetrics Plus” doesn’t cut it, because everybody has a different cut and a slightly different ETL.

schuemie · January 26, 2017, 8:01am

First I’d like to push back against #2. I’m extremely skeptical of trying to create a unified repository of results (e.g. formatted as ‘nanopublications’) for two reasons: (1) modeling scientific results is really hard, and two researchers trying to model the same result will come up with different models, so results will remain disconnected, and (2) I don’t see a realistic use case where this will add value. Of course, people are more than welcome to prove me wrong

@gregk, I like your idea of a study life cycle, so I’m repeating it here, but slightly altered:

a. Initiate study (set objectives)
b. Build a team
c. Design study
d. Find appropriate data sets
e. Execute study - execute code (R, SQL etc…), collect results, iterate
f. Analyze and build insights
g. Publish insights (e.g. paper)

What I was hoping to discuss here is g (not e). I think most of @Chris_Knoll’s comments were also about e.

I wholeheartedly agree with @jon_duke that we need to understand the requirements a lot better before we can talk about solutions. The end goal is not to write papers, the end goal (IMHO) is Informing medical decision making through observational research.

Currently, we hope that our papers are read by doctors. Either a doctor treating a patient reads our paper to make a specific clinical decision (unlikely), or other doctors may derive guidelines from our published results, and these guidelines have an impact on medical care. Another target audience may be regulators, with pretty much the same mechanism. Yet another audience is other scientists, who want to build on our work.

As you may remember, Penelope was a brave departure from this model. The idea was to enrich product labels with results from observational data, so patients and their doctors could directly glean some understanding about what was said in the labels; If the label said ‘May cause horrible things to happen’, a patient could learn whether the horrible thing happened roughly one in two, or one in a billion, which makes all the difference.(BTW: where’s the Penelope paper!?).

Any ideas on how we should go about requirement gathering?

gregk · January 26, 2017, 11:13am

I will be happy to capture and organize requirements. As a start, we could look at those life cycle phases as major components and start by organizing requirements around that, then we can continue to refine those and evolve.

rkboyce · January 26, 2017, 2:11pm

+1 for organizing requirements and thanks @gregk for volunteering to coordinate

Just wanted to make clear that microattribution is not the same as #2 and does not necessarily have anything to do with a unified repository, nanopubs or any single data model. Rather, it is an approach to taking advantage of an existing publishing framework (e.g., supplemental tables in PubMed Central) to create a different way of incentivizing data sharing and providing a simple but persistent approach to track from a published paper to one or more source data repositories that possibly provide enhanced toolsets for working with the data.

Mary_Regina_Boland · January 26, 2017, 4:02pm

I have a couple of thoughts about this.

Perhaps researchers should think about creating a resource for their particular study’s data and then publishing a data descriptor about that resource in a journal such as Scientific Data. This could be published in parallel with the main publication, which contains the analysis, etc. of the study. This would be done at the study-level because certain studies or study collaborators may have different opinions about what data they want to share with the world (sharing with a researcher is a bit different then publishing results online). In this way, the decisions could be made at the study-level to allow as much transparency as possible while retaining privacy when needed.

Just my two cents.

Vojtech_Huser · January 26, 2017, 5:41pm

I want to respond to what Martijn wrote.

g. Publish insights (e.g. **not** a paper (a different medium)

What I was hoping to discuss here is g (not e).
The end goal is not to write papers, the end goal (IMHO) is Informing medical decision making through observational research.

I see a big dilema in using large claims databases and publishing insights. (and pursuing wide bigQuestions and publishing bigAnswers (with our Big Data).

The data vendor is in a businness of licensing data. If we go too far in publishing insights (such as “all by all” recent project) - it conflicts with this business model. Future cutomers can just go to our bigInsights and no longer need the underlying data.

This does not apply so much to a group of local academic medical centers. However there are issues there too. Posting publicly truly wide insights is breaking new ground. We almost need a lawyer in our team for doing this right.

schuemie · January 27, 2017, 8:16am

Thanks all for a terrific discussion!

Let me try and summarize what I’ve heard so far. I think we are saying that, in parallel to publishing papers, we’d also like to publish data sets, either with or without a visualization tool on top. These data sets:

Should be linked to a paper. The reference from the paper to the data should be long-lived, so PURLS (permanent URLs) are an obvious option.
Should link back to the process that generated the data. For example, it could reference the study package in Github. An unsolved problem is how to point to the individual databases that contributed to the result set.
Should be one-on-one linked to studies, because at the study level is where partners make data sharing agreements.
Should somehow count towards one’s academic accomplishments. One path forward here (if I understand correctly) could be publishing the data in Scientific Data in parallel to the main paper.

But, to @jon_duke’s point, what is the need that we’re trying to fulfill? Our target audience / users / customers are (I think) patients, doctors, regulators, pharma, and maybe others. A patient certainly has no use for a large set of results in zipped csv format, so who exactly has? Do we think other scientists will continue working on the data we’ve generated? Well, what are their needs? These are the requirements that I think we need to get clear.

lee_evans · January 27, 2017, 2:13pm

CKAN is a popular open source solution:

Take a look at the feature set here, it could provide some useful ideas for requirements:

jon_duke · January 28, 2017, 3:55pm

Great discussion. Here are thoughts on action steps:

I would be happy to work with @gregk and others (volunteers?) to lead the Dissemination WG with an initial focus on requirements. Like @schuemie, my focus is on g from the lifecycle list. (I can appreciate that other stages will also shape the dissemination story.)
Thanks to the visionaries at the F2F planning committee we have an Information Dissemination track at the March meeting. Of course, many on this thread will be in other tracks so we will want to be sure to spread the conversation outside of just the hackathon window. But please sign up at http://www.ohdsi.org/events/2017-ohdsi-collaborator-face-to-face/

rkboyce · January 28, 2017, 10:47pm

I am interested too … please keep me in the loop on a WG. Also, plan to attend the session at the F2F.

jon_duke · January 29, 2017, 7:46pm

I have set up a Wiki page for the [Dissemination WG][1] and invite those interested to edit the Objectives and add themselves to the Leads section as appropriate. For those who would like to participate whether as a WG lead or participant, please add your name to the [Dissemination Doodle Poll][2].

Hopefully we can make some initial progress prior to the F2F in March to make that time together most productive.

Thanks,

Jon
[1]: http://www.ohdsi.org/web/wiki/doku.php?id=projects:workgroups:dissemination-wg
[2]: http://doodle.com/poll/r4d93zvtcq5g7nbq

Patrick_Ryan · January 29, 2017, 9:04pm

This is a very valuable thread, and glad that many people are engaging in
it. Our shared mission is: “To improve health, by empowering a community
to collaboratively generate the evidence that promotes better health
decisions and better care.” To achieve this mission, we have to
effectively disseminate the evidence we generate to the people who can use
that evidence to inform their decision-making.

As we continue to promote large-scale analysis as the means of objectively,
reliably, and reproducibly generating evidence for clinical
characterization, population-level effect estimation, and patient-level
prediction, it is quite clear that the static 8-page black-and-white pdf
that we call a publication is not going to cut it. If we collectively
generate large resultsets of evidence, then the effective navigation,
exploration, and synthesis of the big results becomes a scientific and
engineering problem onto itself. Just as groups are still grappling with
how to explore and analyze ‘big data’ at the atomic level (in our context,
repositories of individual timestamped clinical observations for a set of
patients), we need to being grappling with how to explore and analyze at
the molecular level (in our context, repositories of summary statistics
derived from the patient-level repositories).

It’s clear to me that, in most circumstances, there won’t be just 1 user
for each of these ‘results repositories’, nor will the results repositories
support only 1 intended use case. So, I think that we do need to clearly
delineate a few aspects:

The process that gave rise to the results repository: I strongly
support the requirement that we need the process to be fully specified and
completely reproducible. I also think, once a process is defined and
documented, that this becomes the appropriate opportunity for ‘peer
review’, so that there is general scientific agreement that execution of
the process will yield valid results. I could even imagine publications
on the protocol (pre-results) could be a reasonable target, so long as we
could find an agreeable home for such publication.
The results repository itself: it makes sense to me that, for each
study, we want to coordinate the capture of aggregate summary statistics
from across a network of data sources. I know there are many different
technology options available for persisting resultsets, but I’m partial to
the idea that all results are structured and could be stored in a
relational database in a ‘results’ schema side-by-side with the CDM, which
would then allow for subsequent analyses. To a large extent, we already
have this infrastructure in place with the way that standardized analysis
results generated in ATLAS are persisted: each source has its own ‘results’
schema, but results across sources could be further consolidated as
needed. I agree with Martijn that we should not impose a requirement of a
‘common results model’ on each resultset: it sounds smart in theory, but
in practice, it was painful when we tried it. Instead, I think the only
hard requirement is that each resultset has a defined and documented
structure that can be accommodated across the suite of supported
environments in OHDSI. Perhaps we could establish some basic conventions
for ‘best practices’ in designing a resultset structure (ex: how to support
results from multiple sources, how to support results that integrate with
the OHDSI vocabularies, etc.). Two principles I’d like to target for all
our resultsets: 1) all researchers are welcome and encouraged to submit
results into a resultset at any point (that is, results are living entities
that don’t ‘close’ with a publication or any other milestone), and 2) all
resultsets should be made publicly and freely available to all interested
users, and any restrictions to data access are fully documented (e.g. small
cell counts [n>5] are removed to protect patient privacy from some sites).
The mechanism(s) to access results from a results repository: here, I
can see an interest and potential need in exploring many different options,
from downloads of resultsets in text files to static tables/graphs in
publications to APIs that allow programmatic access within other
applications to patient-facing graphical user interfaces that allow users
to navigate and explore results. I do quite like the notion of every
large-scale research project that we tackle should have an explicit target
of how to deliver the results, and any publication which highlights the
process that gave rise to the results and provides exposure to the results
repository should also be accompanied by a mechanism for results access. I
think building stand-alone web apps, like we’ve done with ACHILLES or
PENELOPE or the Treatment Pathway study makes good sense where
appropriate. I haven’t yet built a Shiny app, but I’ve been very impressed
with what Martijn, Jenna, Peter and others have been able to do, so I’m
encouraged by that direction as well. Whatever it is we develop, I’d
recommend we involve the intended end user in the design and testing of the
solution, so that we know our evidence is having the impact we desire.

gregk · January 29, 2017, 11:21pm

@jon_duke [quote=“jon_duke, post:16, topic:2196”]
have set up a Wiki page for the Dissemination WG and invite those interested to edit the Objectives and add themselves to the Leads section as appropriate.
[/quote]
excellent, Jon. I will create a sub-page where I will start capturing requirements. Then we can groom and prioritize them as a group. I might reach out to you offline to discuss the best way to organize and manage it.

@Patrick_Ryan [quote=“Patrick_Ryan, post:17, topic:2196”]
think building stand-alone web apps, like we’ve done with ACHILLES or
PENELOPE or the Treatment Pathway study makes good sense where
appropriate.
[/quote] Completely agree with you, Patrick. Not only it makes sense but I think we need to look into a more modular design where we can break up this platform into functionally meaningful modules that can be integrated together - when and if it makes sense. And having a good architecture vision and a simple functional blueprint would be a good first step. I will be helping out with capturing the requirements, so I would be happy to help to create a draft and facilitate this work as well.

but it may make sense to agree to exchange and store them the same semantic model - as opposed to a physical model. All of which already defined by OMOP CDM + vocabs. This way we can at least describe results and that can enable us to process them - visualization, integration or publishing - at lot easier.

schuemie · January 30, 2017, 9:21am

For full transparancy I’ve uploaded our prior attempt at creating a unified result format here.

Just to be clear: we 're advocating against this approach, since it is way too complicated and doesn’t seem to have any significant added value.

Christian_Reich · February 12, 2017, 3:00pm

@Vojtech_Huser: Don’t worry too much. They know that. They know that the value is in the insights, and the data is only a means. More use cases means more insights means more service business.