How should we share data files in a network study?

schuemie · March 3, 2016, 2:04pm

Hi all,

In our research studies, we share the analysis code and execute it at each data site. Afterwards, results need to be communicated back to the study coordinator. In recent studies such as our treatment pathway study, drugs in peds study, and birth month - disease risk study we used e-mail to send data files back and forth, but this is far from ideal. For example, sometimes the data files were to big to fit into someone’s inbox.

Does anybody have suggestions on how better to share data files? Should we set up a central FTP server? Create a web site where people can upload their data? Create the ability to push files directly from R to Amazon cloud instances?

herrcerd · March 3, 2016, 2:23pm

Have you looked at Amazon S3 (AWS S3) ?

Very fast, available/reliable (99.9999% SLA) and supports file versioning, lifecycles and fine-grained access permissions. More importantly they have both ISO and HIPAA security certifications so patient level data from EHR or claims is protected at rest. It has a web-based client, broad API support (java, python, even js), command line etc. and works on every platform.

We use it for backup for all of our source data and to share results with clients securely. You can create completely public buckets as well.

# linux cli example to sync a hypothetical remote bucket to a local dir.
aws s3 sync s3://ohdsi/public/studydata /home/herrcerd/studydata

http://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html

msuchard · March 3, 2016, 3:04pm

+1 for S3. But, what are the advantages / disadvantages over a hosted sftp server? HIPAA certification, going forward, seems like a major selling point.

jets3t is pretty light-weight as compared to the AWS S3 JDK, so we could write an R wrapper around these Java libraries, like we did in DatabaseConnector.

herrcerd · March 3, 2016, 3:20pm

@msuchard IMHO:

Disadvantages vs. hosted sftp

S3 costs money: we pay about $25.00 a month to store 500GB
(cost includes 150GB or so in data transfer fees out of AWS ea. month (transfers within AWS AZ’s free).
It is one more thing to build into a workflow already based on SFTP

Advantages:

99.999999999% durable / 99.99% available SLAs (redundant across geographic regions/datacenters)
If you are doing batch, ETL, or other DW/pipeline work in AWS already, you get 10Gbit/s xfers from S3 to your instances – and no data bandwidth fees (within same AZ / assuming xl or 2xl instance types).
Total audit logging, permissions control, easy to use web interface and APIs.
centralized ACL / credentials - no need to manage user creds / keys on an sftp box.

AZ = Geographic Availability zone

Rijnbeek · March 8, 2016, 4:52pm

Hi Martijn,

As you know we are applying this approach at Erasmus for many years now in our Remote Research Enviroment called OCTOPUS which is in principle a Windows Server accessible via Remote Desktop.

This server contains a FTP upload facility (sFTP) as well.

All our European partners feel comfortable with this approach and would have problems with these email uploads as I mentioned in another post on the forum, so I am happy with this initiave!

Automatic push to a central server… I am not sure about this, it might be a nice added feature but I think we should also stimulate approval and review of the results before submission. In some of our projects DCs really have to sign this off and need to specify with whom these results can be shared. In our case this will not be possible anyway because our database is not connected to the internet but this might be different at other locations.

Peter