Requirements for a data sharing solution

schuemie · March 8, 2016, 12:45pm

This is a reformulation of the discussion started here.

In our research studies, we share the analysis code and execute it at each data site. Afterwards, results need to be communicated back to the study coordinator. In recent studies such as our treatment pathway study, drugs in peds study, and birth month - disease risk study we used e-mail to send data files back and forth, but this is far from ideal. For example, sometimes the data files were to big to fit into someone’s inbox. We need to develop a solution to sharing (study) files.

Please feel free to propose requirements for that solution. To start the list, here are some requirements I can think of:

Should be able to handle large files (which rules out e-mail)
Security: even though OHDSI is all about transparency and openness, most sites would prefer not to make intermediate study files public, at least not before a paper is published. Often access to files should therefore be limited to those involved in the study.
Versioning: Often studies go through several iterations before the final results are produced. Ideally the solution should help keep track of the different versions submitted by sites.

Let me know if you agree or disagree with these, or would like to add additional requirements!

Rijnbeek · March 8, 2016, 5:14pm

See my reply in your other discussion.

Yes
Yes, my experience is that even if the paper is published not all DCs are willing to make the data publically available. We are for example also sharing patient level data (for example case control sets) that should remain in the Remote Research Environment. This might be different in OHDSI. As you know we encrypt the data before upload but that might not be necessary with the current solutions (although it will be an extra level for the DCs)
Yes, but this should contain both versions of the data and the tools. We have even been experimenting with checksums on the datafiles to see if the datafiles have changes over the different runs. Ideally, the tools should add their version information in the output. Even beter (as we do in Jerboa) we add the used script in the upload automatically

Some extra points based on recent discussions we are having in EMIF where we are developping a new files sharing solution as well (i am only advising in that team not developing):

notifications would be great to add. DC uploads and study lead is informed automatically. Even nicer if this is coupled to a workflow system like TASKA (tool being developed in EMIF) where the study lead can login and see the progress made by all participants. This is being explored in EMIF.
you could think of a handshake. 1 DC uploads 2 PI gets notification 3 Accepts the data after review. We do this kind of manually in the data sharing procedure via sftp with email templates.

Happy to help out if you need me Martijn.

Vojtech_Huser · March 10, 2016, 8:29pm

This is only partial response.

For studies that export non-patient level data and for small studies (under 15MB when 7-zipped or zipped), email can still be useful.

Versioning can be addressed by appending v01 or v02 after the file. Typically a PI can only handle a finite number of revision of results. (e.g., under 5 or 10?)

Encrypting the emailed zip file and non-email mechanism for disclosing a very strong password is an alternative to possibly consider.

schuemie · March 15, 2016, 7:32am

Thanks for the input so far!

I just thought of another one:

Data upload should be separate from the study script itself (some sites have their CDM data on computers not connected to the internet)

npuntikov · March 28, 2016, 10:22pm

Dear colleagues,

I’d like to add two topics to this discussion:

Given complexity of the problem, I believe the creation
of a dedicated working group within the OHDSI Wiki would be justified. Clearly,
the need is there given the constant stream of debate items where people are
discussing requirements and use cases related to technical execution of studies
across a network of distributed databases.
Arachne – an intelligent automated
workflow facilitating secure access to patient data and
enabling efficient scientific research based upon statistical
evidence. We presented a poster about it at the OHDSI symposium last
October
(http://www.ohdsi.org/web/wiki/lib/exe/fetch.php?media=resources:arachneposterabstract.pdf).
Arachne is supposed to do all the things being discussed:
various mechanisms for distributed query execution, error handling, result
aggregation and distribution, reuse of query and data, and version
control, as well as data protection, network security and
corporate governance.
It would be great to have a working demo of Arachne, which we could
roll out to collaborators at the next Symposium. We are prepared to put some
skin in the game and allocate resources to implementation, but we cannot be
successful in a vacuum without the interaction with you.

Please share your thoughts. Does this initiative make sense?
I know I am dropping a little bit out of nowhere into this…
My weak excuse is that we’ve been discussing this with Christian and Patrick for a while.
And now seems to be the right time to start doing something

Thank you,
Nick

schuemie · March 29, 2016, 7:23am

Thanks Nick for sharing that, and sorry I missed Arachne earlier.

I’m worried that we’re starting to mix many different things in this discussion. I started with saying some files are too big for e-mail and we need a solution for that, and now we’re talking about a full research infrastructure.

In my humble opinion, the OHDSI research infrastructure should have roughly these components:

Database Wiki (building on our current single page)

One Wiki page per database in OHDSI
Each site maintains own Wiki page
Need template for these pages, which should include:
Country?
Nature of database (claims? Hospital EHR?)
Contact details
With possibility to link to Achilles or Iris results

(This Wiki should be brief in my opinion. Several initiatives like this already exist, relying on huge questionaires. I’m skeptical about the value of those)

Workflow management system

Ability to register new study with lead investigator and participating sites
Keeps track of requests and files sent
Supports web forms
Will send reminders

Analysis sharing technology

Study R packages (like those in our StudyProtocols repo)
SQL
Circe / Calypso definitions + Heracles results?

Data sharing technology

E-mail
FTP?
Amazon S3?

Common quality framework

Share at least Achilles Heel results on database Wiki?
Needs much work!

Common Data Model

OMOP CDM

Common software stack

WebAPI
Atlas
R

Common technology stack (for the database sites)

Windows, MacOs, or Linux
PostgreSQL, Oracle, Sql Server, RedShift, or Microsoft APS
Possible tiers of technology stack (e.g. Amazon Cloud able, GPUs available, beefy machines for advanced analytics)

For now, I’d just like to think about a data sharing solution, but I agree we should keep in mind how this will fit together with all the other pieces of the puzzle.

@Frank: How does this fit with the ideas of the Achitecture workgroup?

Christian_Reich · April 6, 2016, 4:50pm

@schuemie:

Not sure I understand you. You say we don’t need a working group, but then you list 10 items that need to be solved, and you dump it on another working group who certainly don’t think it’s their job. What’s wrong?

schuemie · April 7, 2016, 6:52am

@Christian_Reich, just for the record: I never said we don’t need another working group. I merely listed all the things I think we need for a fully functional research network. (And I remembered another thing to add to the list: Governance, which deals with things like Data Use Agreements).

How we get this work done is another question. We could create a ‘Research Infrastructure Workgroup’, but I have some concerns about that:

The scope of this workgroup would be hard to define. Is it ‘everything in OHDSI that doesn’t fall into any other workgroup’? Certainly this workgroup wouldn’t occupy itself with the CDM because we have a workgroup for that, or with the common technology stack because that is already specified by the architecture and methods workgroups. It shouldn’t have to solve the problem of a data quality framework, that is a whole workgroup in its own right.
Having a workgroup is not the same as solving the problem
I must admit to workgroup fatigue, but maybe I’m the only one

As a side note, I personally like things to grow organically. We were doing fine with e-mail, but now we need more so we should create a better file sharing solution. Currently, nobody has run into the obstacle of workflow management, so let’s not divert all our energy into solving that problem yet. Instead, I would argue our current biggest hurdle is getting everyone in OHDSI capable of leading and performing network studies, or phrased differently, ‘how to translate vague research interests into executable R packages’.